## Machine Learning Model Building Pipeline: Feature Selection

In the following videos, we will take you through a practical example of each one of the steps in the Machine Learning model building pipeline, which we described in the previous lectures. There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building

**This is the notebook for step 3: Feature Selection**


We will use the house price dataset available on [Kaggle.com](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). See below for more details.

===================================================================================================

## Predicting Sale Price of Houses

The aim of the project is to build a machine learning model to predict the sale price of homes based on different explanatory variables describing aspects of residential houses. 

### Why is this important? 

Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or under-estimated.

### What is the objective of the machine learning model?

We aim to minimise the difference between the real price and the price estimated by our model. We will evaluate model performance using the mean squared error (mse) and the root squared of the mean squared error (rmse).

### How do I download the dataset?

To download the House Price dataset go this website:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Scroll down to the bottom of the page, and click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Rename the file as 'houseprice.csv' and save it to a directory of your choice.

**Note the following:**
-  You need to be logged in to Kaggle in order to download the datasets.
-  You need to accept the terms and conditions of the competition to download the dataset
-  If you save the file to the same directory where you saved this jupyter notebook, then you can run the code as it is written here.

====================================================================================================

## House Prices dataset: Feature Selection

In the following cells, we will select a group of variables, the most predictive ones, to build our machine learning model. 

### Why do we select variables?

- For production: Fewer variables mean smaller client input requirements (e.g. customers filling out a form on a website or mobile app), and hence less code for error handling. This reduces the chances of introducing bugs.

- For model performance: Fewer variables mean simpler, more interpretable, better generalizing models


**We will select variables using the Lasso regression: Lasso has the property of setting the coefficient of non-informative variables to zero. This way we can identify those variables and remove them from our final model.**


### Setting the seed

It is important to note, that we are engineering variables and pre-processing data with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

This is perhaps one of the most important lessons that you need to take away from this course: **Always set the seeds**.

Let's go ahead and load the dataset.

In [12]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [13]:
# load the train and test set with the engineered variables

# we built and saved these datasets in the previous lecture.
# If you haven't done so, go ahead and check the previous notebook
# to find out how to create these datasets

X_train = pd.read_csv('../data/processed/xtrain.csv')
X_test = pd.read_csv('../data/processed/xtest.csv')

X_train.head()

Unnamed: 0,id,postcode,primary_address,secondary_address,street,latitude,longitude,grid_ref,county,district,ward,district_code,ward_code,county_code,constituency,region,london_zone,middle_layer_super_output_area,postcode_area,postcode_district,quality,user_type,last_updated,nearest_station,distance_to_station,postcode_area.1,postcode_district.1,police_force,water_company,plus_code,average_income,sewage_company,travel_to_work_area,rural_urban,altitude,region_name,area_code,adjusted_price,type_D,type_F,type_O,type_S,type_T,land_F,new_build_Y
0,0,0.0,0.384615,0.0,0.0,0.29761,0.708851,0.0,0.95122,0.666667,0.0,0.666667,0.0,0.95122,0.0,0.777778,1.0,0.0,0.522727,0.0,0.0,0.0,0.0,0.0,0.689014,0.522727,0.0,0.947368,0.888889,0.0,0.626321,1.0,0.708333,0.666667,0.269953,0.666667,0.666667,12.926339,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0,0.0,0.846154,1.0,0.0,0.240519,0.738338,0.0,1.0,0.666667,0.0,0.666667,0.0,1.0,0.0,1.0,0.666667,0.0,0.522727,0.0,0.0,0.0,0.0,0.0,-0.375478,0.522727,0.0,1.0,0.944444,0.0,0.582384,1.0,1.0,0.666667,0.110329,0.666667,0.666667,12.416423,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0,0.0,0.461538,0.0,0.0,0.420328,0.552343,0.0,0.292683,0.666667,0.0,0.666667,0.0,0.292683,0.0,0.555556,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,1.025923,0.5,0.0,0.368421,0.277778,0.0,0.591111,0.4,0.333333,0.333333,0.298122,0.666667,0.666667,12.862997,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0,0.0,0.923077,0.0,0.0,0.470684,0.716527,0.0,0.146341,0.666667,0.0,0.666667,0.0,0.146341,0.0,0.444444,1.0,0.0,0.340909,0.0,0.0,0.0,0.0,0.0,2.408421,0.340909,0.0,0.263158,0.333333,0.0,0.680413,0.4,0.375,1.0,0.028169,0.666667,0.666667,13.213782,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0,0.0,0.153846,0.0,0.0,0.443657,0.597107,0.0,0.414634,0.666667,0.0,0.666667,0.0,0.414634,0.0,0.444444,1.0,0.0,0.477273,0.0,0.0,0.0,0.0,0.0,1.169875,0.477273,0.0,0.5,0.277778,0.0,0.541677,0.4,0.458333,0.5,0.178404,0.666667,0.666667,12.209063,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [14]:
# capture the target (remember that the target is log transformed)
y_train = X_train['adjusted_price']
y_test = X_test['adjusted_price']

# drop unnecessary variables from our training and testing sets
X_train.drop(['id', 'adjusted_price'], axis=1, inplace=True)
X_test.drop(['id', 'adjusted_price'], axis=1, inplace=True)

### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so remember to set the seed.

In [15]:
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

reg = LassoCV()
X = X_train
y = y_train
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)


print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")


imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [4]:
# We will do the model fitting and feature selection
# altogether in a few lines of code

# first, we specify the Lasso Regression model, and we
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which
# will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel_ = SelectFromModel(Lasso(alpha=0.005, random_state=0))

# train Lasso model and select features
sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, random_state=0))

In [5]:
# let's visualise those features that were selected.
# (selected features marked with True)

sel_.get_support()

array([False,  True,  True,  True,  True,  True, False,  True,  True,
       False, False, False,  True, False,  True,  True, False,  True,
       False,  True,  True, False, False,  True,  True, False,  True,
        True, False,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False])

In [6]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feats = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 46
selected features: 32
features with coefficients shrank to zero: 14


In [7]:
# print the selected features
selected_feats

Index(['primary_address', 'secondary_address', 'street', 'latitude',
       'longitude', 'county', 'district', 'county_code', 'region',
       'london_zone', 'postcode_area', 'quality', 'user_type',
       'distance_to_station', 'postcode_area.1', 'police_force',
       'water_company', 'average_income', 'sewage_company',
       'travel_to_work_area', 'rural_urban', 'altitude', 'region_name',
       'type_D', 'type_F', 'type_O', 'type_S', 'type_T', 'land_F', 'land_L',
       'land_U', 'new_build_N'],
      dtype='object')

### Identify the selected variables

In [8]:
# this is an alternative way of identifying the selected features
# based on the non-zero regularisation coefficients:

selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()] 

selected_feats

Index(['primary_address', 'secondary_address', 'street', 'latitude',
       'longitude', 'county', 'district', 'county_code', 'region',
       'london_zone', 'postcode_area', 'quality', 'user_type',
       'distance_to_station', 'postcode_area.1', 'police_force',
       'water_company', 'average_income', 'sewage_company',
       'travel_to_work_area', 'rural_urban', 'altitude', 'region_name',
       'type_D', 'type_F', 'type_O', 'type_S', 'type_T', 'land_F', 'land_L',
       'land_U', 'new_build_N'],
      dtype='object')

In [9]:
pd.Series(selected_feats).to_csv('../data/processed/selected_features.csv', index=False)

That is all for this notebook. In the next video, we will go ahead and build the final model using the selected features. See you then!