# Term Project Example

In this mini-project our will goal will be to look at Data published by US Census Bureau about the housing market in California. Given i set of features our goal is to find a model that predicts the median housing price in any district. The case has been taken from the book "Hands-On Machine Learning with Scikit-learn & TensorFlow" - Aurelien Geron.

Like all seasoned data scientists we start by loading our notebook with the standard toolbox of packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook

### I - First we frame the problem:

The first task we need to establish is what is the objective of our Machine learning problem. How do we intend to use this data and model in the future. Knowing the objectives is crucial in all the decision we will be taking while building our model. Decisions that relate to anything from how to clean the data to how to evaluate the model.  

Let's assume we are investors and we are trying to find undervalued districts. In this case we chose the California Housing Prices dataset. This dataset was based on data from the 1990 California census.
It is not exactly recent (you could still afford a nice house in the Bay Area at the
time), but it has many qualities for learning, so we will pretend it is recent data. 
Your boss asks you to build a model to predict the median housing price in any district given the metrics in the data.


The first questions we need to answers are: What kind of problem are we looking at?

In this particular case it is obvious that we are dealing with a supervised learning problem that requieres a multivariate regression analysis. 

This said we can still reframe the problem differently at this stage by making the target a price range instead of a median price. In this case we will be dealing with a classification problem.



### II - Get the Data

In [None]:
data_table = pd.read_csv('housing.csv')

In [None]:
data_table.head()

The content and structure of our dataset looks fairly comprehansible. We can explore the data even more in depth by applying the the .info() method to it.

In [None]:
data_table.info()

#### Initial observations:

   - The __total_bedrooms__ feature has only 20433 non-values which means we need to deal with those missing vales.
   - All features are numerical and stored as type float64 except __ocean_proximity__. Pandas loaded as type 'Object' which be any Python Object but by compering with the the .head() output we know we are dealing with strings.

In [None]:
data_table['ocean_proximity'].value_counts()

Next, lets look at the summary of the other features:

In [None]:
data_table.describe()

In [None]:

data_table.hist(bins=50, figsize=(20,15))

### To Notice:

- The median income feature does not seem to be expressed in US terms. After checking I realized that the data has been scaled and capped at 15 and the lower side is 0.5. 
- THe housing median age and median house value are also capped. The latter could be a problem since it is our target attribute. This is not ideal, and we need to see how we can fix it. We can't predict properly when the data is capped to 500,000, which means we will wrongly predict houses with value higher than 500k.
- The features in general vary a lot in scale.
- Many features exhibit distribuations that are far from being normal. In fact several have skews. 

### Visualising geographical Data

In [None]:
data_table.plot(kind="scatter", x="longitude", y="latitude")


In [None]:
data_table.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

By adding the alpha argument we see a much more nuanced visualision of California, with two concertrated areas around Los Angelos and Central Valley.

Now lets add housing prices to the picture:

In [None]:
data_table.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=data_table["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

### Exploring correlations:

In [None]:
corr_matrix = data_table.corr()

In [None]:
corr_matrix

In [None]:
from pandas.plotting import scatter_matrix

In [None]:


attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(data_table[attributes], figsize=(12, 8))

In [None]:
data_table.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])


One last thig you might want to think of while preparing your data is to try to combine some features together. For example the total number of rooms in a district is not very usefull if you don't know how many households there are. What you really want is the number of room per household. You also might want to look at bedrooms relative to the number of total rooms. We hence create a couple of new features:

In [None]:
data_table["rooms_per_household"] = data_table["total_rooms"]/data_table["households"]
data_table["bedrooms_per_room"] = data_table["total_bedrooms"]/data_table["total_rooms"]
data_table["population_per_household"]=data_table["population"]/data_table["households"]

In [None]:
corr_matrix = data_table.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

This is type of analysis is not exhaustive. It is just an illustrative example on how to think about your data.

### Prepare the Data for Machine Learning

Let's start by cleaning the data. We have seen earlier that total_bedrooms feature has soem missing value. We have three options to deal with that:

### Missing Data:

- Get rid of the corresponding districts
- Get rid of the whole feature
- Set the values to some value(mean, median, zero..etc)


For each option pandas offers a function: 

In [None]:
sample_incomplete_rows = data_table[data_table.isnull().any(axis=1)].head()
sample_incomplete_rows

In [None]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1

In [None]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2

In [None]:
median = data_table["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3

In [None]:
data_table["total_bedrooms"].fillna(median, inplace=True)
data_table["bedrooms_per_room"].fillna(median, inplace=True)

In [None]:
data_table.info()

We opt for the third method. 

### Categorical Features:

In [None]:
housing_cat = data_table[["ocean_proximity"]]


In [None]:
housing_cat

Most Machine learning algorithims work only with number. Hence we need to transform those categories to numbers:

We this we will use the SKlearn calss called OrdinalEncoder.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded = pd.DataFrame(housing_cat_encoded)
housing_cat_encoded.columns = ['Category']

What is the problem with this encoding?

In [None]:
housing_cat_encoded

One issue with this encoding as we have discussed in previous class was the fact that some ML algorithims might assume that two nearby values are more similar that two distant. 

In [None]:
data_table.shape

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
columns =cat_encoder.categories_[0].tolist()
columns

In [None]:
data  = housing_cat_1hot.toarray()


In [None]:
n_frame = pd.DataFrame(data)


In [None]:
n_frame.columns=columns

In [None]:
n_frame

In [None]:
n_data_table = pd.concat([data_table,n_frame],axis=1,sort=False)

In [None]:
n_data_table

In [None]:
n_data_table = n_data_table.drop('ocean_proximity',axis=1)

In [None]:
n_data_table.shape

In [None]:
n_data_table.info()

### Split the Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()


y = n_data_table['median_house_value']
X = n_data_table.drop('median_house_value',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



### Apply Regression 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import  Lasso
from sklearn.preprocessing import PolynomialFeatures
linreg = LinearRegression().fit(X_train_scaled, y_train)

We first look at the outcome of an OLS linear regression:


In [None]:
print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test_scaled, y_test)))

The results don't look great for linear regression - Oddly the test score is higher than the training score which is usually a sign of underfitting.
Next we try to run a lasso regression.


In [None]:
linlasso = Lasso(alpha=20.0, max_iter = 10000).fit(X_train_scaled, y_train)

print('Crime dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X_train), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

In [None]:

from sklearn.model_selection import cross_val_score



cv_scores = cross_val_score(linlasso, X, y,cv=5)

print('Cross-validation scores (3-fold):', cv_scores)
print('Mean cross-validation score (3-fold): {:.3f}'
     .format(np.mean(cv_scores)))

In [None]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))

In [None]:
from sklearn.model_selection import validation_curve

param_range = np.linspace(0, 50, 10)
train_scores, test_scores = validation_curve(Lasso( max_iter = 10000), X, y,
                                            param_name='alpha',
                                            param_range=param_range, cv=3)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train_scaled, y_train)

In [None]:
housing_predictions = tree_reg.predict(X_train_scaled)
tree_mse = mean_squared_error(y_train, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
housing_predictions = tree_reg.predict(X_test_scaled)
tree_mse = mean_squared_error(y_test, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
y

In [None]:
y_classes = y.copy()
y_classes[y_classes <= 100000] = 1
y_classes[y_classes <= 200000] = 2
y_classes[y_classes <= 300000] = 3
y_classes[y_classes <= 400000] = 4
y_classes[y_classes <= 500000] = 5


In [None]:
y_classes