# Assignment 3: End-to-End Machine Learning Project
The material in this assignment is based on Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow, by Aurelieln Geron.

# What problem are we trying to solve and how will we solve it?
In our case, we will be trying to build a model of California housing prices using Census data.   The primary goal: be able to predict the median housing price in any California district, using the data available in this dataset.  This problem is an example of **regression**, where the prediction of our model (or its output) is a continuous variable. This is in contrast to **classification**, where the prediction of our model (or its output) is a class or group.

The typical steps in such an analysis vary depending on the problem, but they usually include the following:
1.  Get the data.
2.  Minimally clean and prepare the data.
3.  Explore the data, typically using visualizations.
4.  Select a model appropriate for your particular problem and train it.
5.  Test the model using unseen data.
6.  Fine tune the model.
7.  Present the results.

We will go through all of these steps.  We won't dwell on the details of the model - we will use it like a **black box**.   Later on in the course, we will spend more time on the details.

# 1) Get the data

In [2]:
import pandas as pd

# Now let's print some data to the screem
housing = pd.read_csv("https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch2/housing.csv")
print("Housing columns:",housing.columns)

Housing columns: Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


Our goal in this assignment: predict the **median_house_value** given all of the other data.   "median_house_value" will be our label.  All of the other columns are our **features**.

#2) Explore the data
As we did in chapter 1, we are going to want to explore the data.

1.  Look at a few rows of the dataset: use housing.head().
2.  Get some info about the names and types of the columns in the dataframe, the number of rows, and how much memory the dataframe takes up: using housing.info()
3.  Get some basic statistical info about the dataframe (mean, std, etc): use housing.describe()
4.  Get correlations among all of the columns: use housing.corr()

#3) Feature Engineering:

Feature engineering refers to combining existing features to form new ones. These combination might be simple (like the result of adding/subracting/multiplying/etc) or they could be more complex - like the results of a sophisticated analysis. The basic idea is to add information for each candidate data point, which will hopefully improve whatever model we end up using to perform our predictions.

In our case there are some obvious new features we can create.

1.  rooms_per_househoud
2.  bedrooms/househoud
3.  bedrooms/room
4.  people per househoud

Think about the startified sampling that we did earlier, and note that by far the most correlated variable in our dataset in **median_income**.   So when we split our data we would like to know for sure that our test sample is close in distribution to the median income of our train sample.   Will this be true if we just randomly split the data?   Welaready know that he answer is "not quite".

To test this, let's make a **categorical** vaiable which describes median income.   We will have 5 categories, running from 1.0 (low) to 5.0 (high).

In [0]:
import numpy as np

housing['income_cat'] = np.ceil(housing['median_income'] /1.5)
housing['income_cat'].where(housing['income_cat']<5.0,5.0,inplace=True)


# 4)  Train/Test Splitting
Out goal is to design an algorithm to prediction housing prices.   To test our model, we will want to split our data into two parts:
1.  Training sample: This is the sample we will train our model on.
2.  Testing sample: This is the **unseen** data that we will test our trained model on.  Good performance on this sample will ensure that our model generalizes well.

Use a split of 80% train and 20% test.   You can do a **random split**, but a better split is stratified accoring to the income category variable we defined above.   No matter what split you end up using, make sure you see how well the test and train sets agree in that variable.



# 5) Dealing with missing data
You could try:
1.  Removing all rows with any missing data
2.  Replacing the missing data with the mean of the column:  **NOTE**: if you do this, you must get the means from the **training** set.    Think about why this is the case.

# 6) Feature Scaling
We will use feature scaling as we did with the fligth dataset.  In this case, use **standardization**.   Remember: you need to use the **training** set to **fit** the transformer, and you need to use the **transformer** on **both** the training and test sets.

Remember that we do not use these techniques for **categorical** columns (something different will be done).


# 7) Combining everything before fitting

Refer to the ealrlier workbook titled "Putting Humpty-Dumpty back together!""

After all of our above work we should have:
1.   two numpy arrays containing our "scaled" numerical features, one for our training sample and one for our testing sample
2.   two one-hot-encoded numpy arrays for our categorical variable, one for our training sample and one for our testing sample

We need to combine these so we have **one** training numpy array, and **one** testing numpy array.   Along with each of these, we will have **label** arrays, made from the median_house_value column for the test and train samples.


# 8) Fit the data and test the fit
As before the fit model will be linear regression (we are using more than just a single feature but is it still just linear regression).   Test the fit 

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
#
# Do this on test AND training set!
lin_mse = mean_squared_error(test_housing_labels,test_housing_labels_pred)
lin_rmse = np.sqrt(lin_mse)    ## Remember to take the square root!
print("Mean squared error and the root mean square",lin_mse,lin_rmse)



# 9) Some extra stuff

If you are looking for more to do!
1.  **Making maps**
This data is interesting since it has latitude and longitude. Previously we made world maps, but this depended on our data having tags which were country names. This is different. This will be more like a scatter-plot, but arranged on an existing map (primarily California). How do we do this?
Google: plotly map scatter
Take the code from the first example and modify it:

2.  *IF* you used random smaple for your test/train split, try using instead stratified sampling based on the income category variable.

3.   We probably should have done this first.... but how do we *know* that our fit imporved our knowledge?   Is there a simple predictor that we could have used instead?   How about if we predict the price simply based on the mean (or the median) of all housing prices?   Use the mean squared error to do this

4.   Try another predictor from sklearn:  RandomForestRegressor and/or DecisionTreeRegressor.   Make sure you test the fit results (using mean_squared_error) on BOTH the training AND test sets!