# Building Machine Learning Pipelines: Feature Selection Phase

In this and the upcoming videos we will focus on creating Machine Learning Pipelines considering all the life cycle of a Data Science Projects. This will be important for professionals who have not worked with huge dataset.

## Project Name: House Prices: Advanced Regression Techniques

The main aim of this project is to predict the house price based on various features which we will discuss as we go ahead

#### Dataset to downloaded from the below link
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

### All the Lifecycle In A Data Science Projects 
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building
5. Model Deployment

### Contents 

1st Pipeline : House Price - Exploratory Data Analysis 

1. Data Analysis aka EDA ( Exploratory Data Analysis )
    a.   Missing Values
              i)   Explore how missing values affect our data
              ii)  Check relationship of missing values features and the dependent feature ( SalePrice ) using BARPLOT
    b.   Numerical Variables
              i)   Explore numerical features in our data
              ii)  Explore Temporal Variable ( Eg : Datetime )
              iii) Types of Numerical Variables
              iv)  Distribution of Numerical Variables in our data
              v)   Log Transformation of Variable
    c.   Categorical Variables and its cardinality  
              i)   Explore categorical features in our data
              ii)  Relationship between Categorical Variables and dependent feature
    d.   Outliers 


2nd Pipeline : House Price - Feature Engineering  

2. Feature Engineering  
    a.   Handling Missing Values  
              i)   Handling Categorical Values
              ii)  Handling Numerical Values
    b.   Handling Temporal Variables  
    c.   Handling Numerical Variables  
              i)  Skewed Data to Log Normal Distribution
    d.   Handling Categorical Variables  
              i)  Rare Category Elimination
              ii) Conversion of Category to Numerical Values
    d.   Standarize the values of Variables  

3rd Pipeline : House Price - Feature Selection

3. Feature Selection  
    a.   Lasso and SelectFromModel to select the features

## After doing Lasso and SelectFromModel , we select only 21 features out of 82 at last.  

In [16]:
#Importing Libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

pd.pandas.set_option('display.max_columns',None)
pd.set_option("display.max_rows",None)

#For feature selection
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

In [17]:
#Importing Dataset
X_test = pd.read_csv('X_test_after_FE.csv')

In [18]:
X_test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontagenan,MasVnrAreanan,BsmtFullBathnan,BsmtHalfBathnan,GarageYrBltnan
0,1461,0.0,0.6,0.593445,0.56636,1.0,0.5,1.0,1.0,0.0,1.0,0.0,0.5,0.125,0.5,0.0,0.333333,0.444444,0.625,0.384615,0.822581,0.2,0.0,0.846154,0.866667,0.75,0.0,1.0,1.0,0.2,1.0,1.0,1.0,0.833333,0.116708,0.5,0.094364,0.126168,0.173111,0.0,1.0,1.0,1.0,0.312253,0.0,0.0,0.312253,0.0,0.0,0.25,0.0,0.333333,0.5,1.0,0.166667,1.0,0.0,0.6,0.166667,0.792994,1.0,0.2,0.490591,1.0,1.0,1.0,0.098315,0.0,0.0,0.0,0.208333,0.0,1.0,0.75,0.333333,0.0,0.454545,0.0,1.0,0.8,0.0,0.0,0.0,0.0,0.0
1,1462,0.0,0.8,0.598957,0.622527,1.0,0.5,0.0,1.0,0.0,0.0,0.0,0.5,0.25,0.5,0.0,0.333333,0.555556,0.625,0.407692,0.870968,0.6,0.0,0.923077,0.933333,0.25,0.083721,1.0,1.0,0.2,1.0,1.0,1.0,0.0,0.230175,1.0,0.0,0.18972,0.260844,0.0,1.0,1.0,1.0,0.468253,0.0,0.0,0.468253,0.0,0.0,0.25,0.5,0.5,0.5,0.5,0.25,1.0,0.0,0.6,0.166667,0.802548,1.0,0.2,0.209677,1.0,1.0,1.0,0.275983,0.048518,0.0,0.0,0.0,0.0,1.0,0.5,0.0,0.735294,0.454545,0.0,1.0,0.8,0.0,0.0,0.0,0.0,0.0
2,1463,0.235294,0.8,0.558854,0.614005,1.0,0.5,0.0,1.0,0.0,1.0,0.0,0.333333,0.25,0.5,0.0,0.666667,0.444444,0.5,0.107692,0.225806,0.2,0.0,0.846154,0.866667,0.75,0.0,1.0,1.0,0.4,0.5,1.0,1.0,0.333333,0.197257,1.0,0.0,0.064019,0.182139,0.0,0.5,1.0,1.0,0.326139,0.376477,0.0,0.548792,0.0,0.0,0.5,0.5,0.5,0.5,1.0,0.25,1.0,0.25,1.0,0.166667,0.678344,0.0,0.4,0.323925,1.0,1.0,1.0,0.148876,0.045822,0.0,0.0,0.0,0.0,1.0,0.75,0.333333,0.0,0.181818,0.0,1.0,0.8,0.0,0.0,0.0,0.0,0.0
3,1464,0.235294,0.8,0.582212,0.524583,1.0,0.5,0.0,1.0,0.0,1.0,0.0,0.333333,0.25,0.5,0.0,0.666667,0.555556,0.625,0.1,0.225806,0.2,0.0,0.846154,0.866667,0.25,0.015504,1.0,1.0,0.4,1.0,1.0,1.0,0.333333,0.150125,1.0,0.0,0.151402,0.181747,0.0,0.0,1.0,1.0,0.325285,0.364125,0.0,0.542672,0.0,0.0,0.5,0.5,0.5,0.5,0.5,0.333333,1.0,0.25,0.4,0.166667,0.675159,0.0,0.4,0.31586,1.0,1.0,1.0,0.252809,0.048518,0.0,0.0,0.0,0.0,1.0,0.5,0.333333,0.0,0.454545,0.0,1.0,0.8,0.0,0.0,0.0,0.0,0.0
4,1465,0.588235,0.8,0.317987,0.335596,1.0,0.5,0.0,0.333333,0.0,1.0,0.0,0.916667,0.25,0.5,1.0,0.333333,0.777778,0.5,0.146154,0.322581,0.2,0.0,0.461538,0.4,0.75,0.0,0.666667,1.0,0.4,0.5,1.0,1.0,0.0,0.065586,1.0,0.0,0.475234,0.251227,0.0,0.0,1.0,1.0,0.453388,0.0,0.0,0.453388,0.0,0.0,0.5,0.0,0.333333,0.5,0.5,0.166667,1.0,0.0,0.6,0.166667,0.694268,0.666667,0.4,0.340054,1.0,1.0,1.0,0.0,0.110512,0.0,0.0,0.25,0.0,1.0,0.5,0.333333,0.0,0.0,0.0,1.0,0.8,0.0,0.0,0.0,0.0,0.0


In [19]:
fea = ['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'YearRemodAdd',
       'RoofStyle', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive',
       'SaleCondition']
fea

['MSSubClass',
 'MSZoning',
 'Neighborhood',
 'OverallQual',
 'YearRemodAdd',
 'RoofStyle',
 'BsmtQual',
 'BsmtExposure',
 'HeatingQC',
 'CentralAir',
 '1stFlrSF',
 'GrLivArea',
 'BsmtFullBath',
 'KitchenQual',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageCars',
 'PavedDrive',
 'SaleCondition']

In [20]:
X_test = X_test[fea]
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,Neighborhood,OverallQual,YearRemodAdd,RoofStyle,BsmtQual,BsmtExposure,HeatingQC,CentralAir,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,PavedDrive,SaleCondition
0,0.0,0.6,0.5,0.444444,0.822581,0.2,1.0,1.0,1.0,1.0,0.312253,0.312253,0.0,1.0,0.0,0.6,0.166667,1.0,0.2,1.0,0.8
1,0.0,0.8,0.5,0.555556,0.870968,0.6,1.0,1.0,1.0,1.0,0.468253,0.468253,0.0,0.5,0.0,0.6,0.166667,1.0,0.2,1.0,0.8
2,0.235294,0.8,0.333333,0.444444,0.225806,0.2,0.5,1.0,0.5,1.0,0.326139,0.548792,0.0,1.0,0.25,1.0,0.166667,0.0,0.4,1.0,0.8
3,0.235294,0.8,0.333333,0.555556,0.225806,0.2,1.0,1.0,0.0,1.0,0.325285,0.542672,0.0,0.5,0.25,0.4,0.166667,0.0,0.4,1.0,0.8
4,0.588235,0.8,0.916667,0.777778,0.322581,0.2,0.5,1.0,0.0,1.0,0.453388,0.453388,0.0,0.5,0.0,0.6,0.166667,0.666667,0.4,1.0,0.8


In [21]:
X_test.to_csv("X_test_after_Feature_Selection.csv",index=False)

In [10]:
# #Apply Feature Selection
# #Using Lasso Regression Model, and we need to select alpha ( equivalent of penalty ). Bigger alpha -> Less Feature Selection
# #After that, we use SelectFromModel which will select the features whose coefficients are non-zero

# feature_sel_model = SelectFromModel(Lasso(alpha=0.005,random_state=0))
# feature_sel_model.fit(X_test)

ValueError: y cannot be None

In [17]:
# feature_sel_model.get_support()

array([ True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True, False,
       False,  True,  True, False, False, False, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False, False, False,  True, False,
       False,  True,  True,  True, False,  True,  True, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False])

In [19]:
# #Lets see how many models are selected

# selected_feature = X_train.columns[(feature_sel_model.get_support())]
# selected_feature

Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'YearRemodAdd',
       'RoofStyle', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive',
       'SaleCondition'],
      dtype='object')

In [26]:
# print("\n Total Features : {}".format((X_train.shape[1])))
# print("\n Selected Features : {}".format(len(selected_feature)))
# print("\n Features with coefficient to zero : {}".format(np.sum(feature_sel_model.estimator_.coef_ ==0 )))


 Total Features : 82

 Selected Features : 21

 Features with coefficient to zero : 61


In [27]:
# selected_feature

Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'YearRemodAdd',
       'RoofStyle', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive',
       'SaleCondition'],
      dtype='object')

In [28]:
# X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,0.235294,0.75,0.418208,0.366344,1,1.0,0.0,0.333333,1,0.0,0.0,0.636364,0.4,1,0.75,1.0,0.666667,0.5,0.036765,0.098361,0.0,0,1.0,1.0,0.5,0.1225,0.666667,1.0,1.0,0.75,0.75,0.25,1.0,0.125089,0.833333,0.0,0.064212,0.140098,1.0,1.0,1,1.0,0.356155,0.413559,0.0,0.577712,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.5,1.0,0.0,0.2,0.8,0.046729,0.666667,0.5,0.38646,0.666667,1.0,1.0,0.0,0.111517,0.0,0.0,0.0,0.0,0,1.0,1.0,0.0,0.090909,0,0.666667,0.75,0,0,0
1,0.0,0.75,0.495064,0.391317,1,1.0,0.0,0.333333,1,0.5,0.0,0.5,0.2,1,0.75,0.6,0.555556,0.875,0.227941,0.52459,0.0,0,0.4,0.3,0.25,0.0,0.333333,1.0,0.5,0.75,0.75,1.0,0.666667,0.173281,0.833333,0.0,0.121575,0.206547,1.0,1.0,1,1.0,0.503056,0.0,0.0,0.470245,0.0,0.5,0.666667,0.0,0.375,0.333333,0.333333,0.333333,1.0,0.333333,0.6,0.8,0.28972,0.666667,0.5,0.324401,0.666667,1.0,1.0,0.347725,0.0,0.0,0.0,0.0,0.0,0,1.0,1.0,0.0,0.363636,0,0.666667,0.75,0,0,0
2,0.235294,0.75,0.434909,0.422359,1,1.0,0.333333,0.333333,1,0.0,0.0,0.636364,0.4,1,0.75,1.0,0.666667,0.5,0.051471,0.114754,0.0,0,1.0,1.0,0.5,0.10125,0.666667,1.0,1.0,0.75,0.75,0.5,1.0,0.086109,0.833333,0.0,0.185788,0.150573,1.0,1.0,1,1.0,0.383441,0.41937,0.0,0.593095,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.333333,1.0,0.333333,0.6,0.8,0.065421,0.666667,0.5,0.428773,0.666667,1.0,1.0,0.0,0.076782,0.0,0.0,0.0,0.0,0,1.0,1.0,0.0,0.727273,0,0.666667,0.75,0,0,0
3,0.294118,0.75,0.388581,0.390295,1,1.0,0.333333,0.333333,1,0.25,0.0,0.727273,0.4,1,0.75,1.0,0.666667,0.5,0.669118,0.606557,0.0,0,0.2,0.4,0.25,0.0,0.333333,1.0,0.25,0.5,1.0,0.25,0.666667,0.038271,0.833333,0.0,0.231164,0.123732,1.0,0.75,1,1.0,0.399941,0.366102,0.0,0.579157,0.333333,0.0,0.333333,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.333333,0.8,0.4,0.074766,0.333333,0.75,0.45275,0.666667,1.0,1.0,0.0,0.063985,0.492754,0.0,0.0,0.0,0,1.0,1.0,0.0,0.090909,0,0.666667,0.0,0,0,0
4,0.235294,0.75,0.513123,0.468761,1,1.0,0.333333,0.333333,1,0.5,0.0,1.0,0.4,1,0.75,1.0,0.777778,0.5,0.058824,0.147541,0.0,0,1.0,1.0,0.5,0.21875,0.666667,1.0,1.0,0.75,0.75,0.75,1.0,0.116052,0.833333,0.0,0.20976,0.187398,1.0,1.0,1,1.0,0.466237,0.509927,0.0,0.666523,0.333333,0.0,0.666667,0.5,0.5,0.333333,0.666667,0.583333,1.0,0.333333,0.6,0.8,0.074766,0.666667,0.75,0.589563,0.666667,1.0,1.0,0.224037,0.153565,0.0,0.0,0.0,0.0,0,1.0,1.0,0.0,1.0,0,0.666667,0.75,0,0,0


In [29]:
# X_train.shape

(1460, 82)

In [30]:
# X_train = X_train[selected_feature]

In [31]:
# X_train.shape

(1460, 21)

In [32]:
# X_train.head()

Unnamed: 0,MSSubClass,MSZoning,Neighborhood,OverallQual,YearRemodAdd,RoofStyle,BsmtQual,BsmtExposure,HeatingQC,CentralAir,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,PavedDrive,SaleCondition
0,0.235294,0.75,0.636364,0.666667,0.098361,0.0,0.75,0.25,1.0,1,0.356155,0.577712,0.333333,0.666667,0.0,0.2,0.8,0.666667,0.5,1.0,0.75
1,0.0,0.75,0.5,0.555556,0.52459,0.0,0.75,1.0,1.0,1,0.503056,0.470245,0.0,0.333333,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
2,0.235294,0.75,0.636364,0.666667,0.114754,0.0,0.75,0.5,1.0,1,0.383441,0.593095,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
3,0.294118,0.75,0.727273,0.666667,0.606557,0.0,0.5,0.25,0.75,1,0.399941,0.579157,0.333333,0.666667,0.333333,0.8,0.4,0.333333,0.75,1.0,0.0
4,0.235294,0.75,1.0,0.777778,0.147541,0.0,0.75,0.75,1.0,1,0.466237,0.666523,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.75,1.0,0.75


In [33]:
# X_train.to_csv("X_train_after_Feature_Selection.csv",index=False)

In [34]:
# y_train.to_csv("y_train_after_Feature_Selection.csv",index=False)

### Now we move towards the Model Building Phase