<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">COMP5611M - Building a Machine Learning Pipeline</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Marc de Kamps and University of Leeds</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

# Building a Machine Learning Pipeline (Part 3)

## Objectives

Here, we finally analyse the data. We will experiment with different regressors, preform cross validation and parameter search and will use the *scikit-learn* interface to do so.

In particular, we will 
- Apply linear regression to make predictions
- Apply decision trees to make predictions
- Apply random forests to make predictions
- Compare these different regression methods ob the mean squared error criterion
- Perform cross validation and grid search

In [None]:
import os
import numpy as np
import pandas as pd
import tarfile

from sklearn.model_selection import StratifiedShuffleSplit

local_path = 'datasets/housing'


def restore():
    housing_tgz=tarfile.open(os.path.join(local_path,'./housing.tgz'))
    housing_tgz.extractall(path=local_path)
    housing_tgz.close()

    csv_path=os.path.join(local_path,'./housing.csv')
    housing = pd.read_csv(csv_path)

    # create test training set with stratified sampling (see previous notebook)
    housing["income_category"]=np.ceil(housing["median_income"]/1.5)
    housing["income_category"].where(housing["income_category"] < 5, 5.0, inplace = True)

    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)

    for train_index, test_index in split.split(housing,housing["income_category"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]
    
    for set_ in (strat_train_set, strat_test_set):
        set_.drop(("income_category"),axis=1,inplace=True)
        
   
    return strat_train_set, strat_test_set

strat_train_set, strat_test_set = restore()

In [None]:
housing=strat_train_set.drop("median_house_value",axis=1)
housing_labels=strat_train_set["median_house_value"].copy()

In [None]:
housing_num=housing.drop("ocean_proximity",axis=1)

In [None]:
num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
from sklearn.impute import SimpleImputer

imputer=SimpleImputer(strategy="median")

In [None]:
from sklearn.base import BaseEstimator

class CombinedAttributesAdder(BaseEstimator):

    def __init__(self, do_add_bedrooms_per_room = False):
        
        # simply a binary variable per room
        self.do_add_bedrooms_per_room = do_add_bedrooms_per_room
        
        # These are the column indices of the respective columns. OK for illustration purposes.
        # For more robust code you would want to extract these values from the DataFrame by name.
        self.rooms_ix      = 3
        self.bedrooms_ix   = 4
        self.population_ix = 5
        self.household_ix  = 6
        
    def fit(self, X, y=None):
        # We don't transform the target values here
        return self
    
    def transform(self, X, y=None):
        rooms_per_household = X[:,self.rooms_ix]/X[:,self.household_ix]
        population_per_household = X[:, self.population_ix]/ X[:,self.rooms_ix]
        if self.do_add_bedrooms_per_room:
            bedrooms_per_room = X[:,self.bedrooms_ix]/X[:,self.rooms_ix]
            return np.c_[X,rooms_per_household, population_per_household,bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        

In [None]:
class DataFrameSelector(BaseEstimator):
    
    def __init__(self, attribute_names):
        self.attribute_names= attribute_names
        
    def fit(self,X, y = None):
        return self
    
    def transform(self, X):
        return X[self.attribute_names].values



In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
num_pipeline= Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer',SimpleImputer(strategy="median")),
    ('attribs_adder',CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
])

In [None]:
cat_pipeline = Pipeline([
    ('selector',DataFrameSelector(cat_attribs)),
    ('one hot',OneHotEncoder())
])

In [None]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline",num_pipeline),
    ("cat_pipeline",cat_pipeline)
])

In [None]:
housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
housing_prepared

In [None]:
housing_prepared.shape

### Regression
Up until this point, the pipeline is a succcint version of all we did in Part 2, with one small exception that we encourage you to find. *housing_prepared* is fully imputed and processed version of the test set. *housing_labels* the associated labesl.  The code for linear regression is shockingly simple.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg=LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

It is always a good idea to see what this looks like on the dataset:

In [None]:
some_data=housing.iloc[:5]
some_labels=housing_labels.iloc[:5]
some_data_prepared=full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse=mean_squared_error(housing_labels, housing_predictions)
lin_rmse=np.sqrt(lin_mse)
print(lin_rmse)

It does something sensible, but the root mean squared error is sizeable. Although some of this is caused by the deviation of prediction and outcome for the more expensive homes, inspection of individual predictions shows substantial deviations for some data points. And this is on the **training set**.

Let's try a decsion tree prediction

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()

tree_reg.fit(housing_prepared, housing_labels)

Note the similarity of interface. Prediction also looks similar.

In [None]:
housing_predictions=tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rsme=np.sqrt(tree_mse)
tree_rsme

No error at all! Actually this is somewhat suspect. There is no doubt the tree has learnt the training set perfectly, but this may indicate overfitting.

### Cross validation
In cross validation the dataset is partitioned. In n-fold cross validation, the following experimnt is repeated n times: set a fraction of n-1/n patterns apart from training and use the rest for evaluation. This gives n different scores. THe implementation is straightforward with only one caveat: *scikit-learn* expects a utility function (lower is better than higher) rather than a cost function, and the scoring function is the negative of the MSE. The code below compensates for that.

In [None]:
from sklearn.model_selection import cross_val_score

scores =cross_val_score(tree_reg, housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)

tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print('Scores:',scores)
    print('Mean:',scores.mean())
    print('Standard deviation:',scores.std())
    
display_scores(tree_rmse_scores)

If anything, the performance is now slightly worse than for linear regression. This is a clear example of overfitting by the original tree.

**Exercise** Carry out cross validation for the linear regression

Now, we will try a random forest regressors.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg=RandomForestRegressor()
forest_reg.fit(housing_prepared,housing_labels)

housing_predictions=forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rsme=np.sqrt(forest_mse)
forest_rsme

This looks encouraging: a better error than linear regression, but no clear overfitting. Again, let's cross validate.

In [None]:
scores=cross_val_score(forest_reg, housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
forest_rmse_scores=np.sqrt(-scores)

display_scores(forest_rmse_scores)

## Grid Search

This is clearly better than the linear regressor. We have used the random forest 'out of the box', not bothering to tweak its parameters. The lectures should have given you some ideas on how different parameter settings can be used to alter decision trees and random forests. In a *grid search* you systematically try out combinations of relevant parameters. As you can check from the *scikit-learn* documentation, you can vary (at least) the following parameters:

- n_estimators
- max_features
- bootstrap

Suppose you want to systematically explore a number of parameter settings, then *GridSearchCV* can help you do this:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid  = [{'n_estimators': [3, 10, 30],'max_features':[2, 4, 6, 8]}, 
               {'bootstrap': [False],'n_estimators':[3,10],'max_features':[2,3,4]}]

forest_reg=RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

After this has finished, you need to get the best scoring model:

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

The evaluation scores are available:

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"],cvres["params"]):
    print(np.sqrt(-mean_score),params)

You should explore the possibility to do a **RandomizedSearchCV**.

**Exercise** Why is this sometimes a better option? When would you use it?

### Feature Importance

It can be useful to explore feature importance:

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs= ["rooms_per_hhold", "pop_per_hhold","bedrooms_per_room"]

# Here we use the OneHotEncoder class to retrieve the original categories
encoder=OneHotEncoder()
encoder.fit(housing[cat_attribs])
cat_one_hot_attribs = [ el for el in encoder.categories_[0]]
print(cat_one_hot_attribs)
attributes=num_attribs + extra_attribs + cat_one_hot_attribs


In [None]:
sorted(zip(feature_importances,attributes),reverse=True)

Interestingly, the feature that a house must be close to the ocean seems to be the only important categorical feature. You can consider leaving out the the others.


## Evaluation on the test set

This is a step you always should do. If the results are substantially worse than the cross validated ones, you still must distrust your model.

In [None]:
final_model=grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value",axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions=final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test,final_predictions)
final_rsme = np.sqrt(final_mse)
print(final_rsme)

This looks reasonably close to the cross validated results on the training set.