# Feature Selection on the 111011001 dataset

In this notebook we examine the result of dropping subsets of features that are linearly dependent or correlated to some degree.  We're specifically working on the dataset formed by dropping (31, 496, 524, 692, 917, 1299), since that resulted in the 3rd lowest validation error.

In [1]:
import itertools
import numpy as np
import pandas as pd

pd.set_option('display.precision',20)
pd.set_option('display.max_colwidth',100)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict, KFold, cross_val_score, \
                                    GridSearchCV, RandomizedSearchCV, ShuffleSplit 
from time import time
from scipy.stats import randint as sp_randint

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
rcParams['figure.figsize'] = 12, 4
%matplotlib inline

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [3]:
# Cross-validation sets
kfold = KFold(n_splits=10, random_state=7)

# We are using LassoLarsCV as part of our metric
lr = linear_model.LassoLarsCV(verbose=False, max_iter=5000,precompute='auto', cv=kfold, max_n_alphas=1000, n_jobs=-1)

In [4]:
df = pd.read_csv("./input/train_tidy_111011001.csv")

In [5]:
ss = ShuffleSplit(n_splits=1, test_size=0.20, random_state=573)

In [6]:
X = df.values

In [7]:
for train_idx, validation_idx in ss.split(X):
    train_df = df.iloc[train_idx]
    validation_df = df.iloc[validation_idx]

We will establish a baseline by keeping all features.

In [8]:
y_validation = validation_df['SalePrice'].values
x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
y_train = train_df['SalePrice'].values
x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
lr.fit(x_train, y_train)
y_pred = lr.predict(x_validation)
baseline = rmse(y_validation, y_pred)
baseline



0.11777391463448175

## Features

We have a collection of Features, some of which were identified to be of potentially low-quality in predicting the response, others of which are known to be highly correlated with other Features.  We want to identify subsets of features that we can drop to improve the regression.

In [9]:
drop_cands = [
    'LotFrontage', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF',
    'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AllSizesSumLin', 'AreasSum',
    'X1stFlrSF','X1stLin', 'X2ndFlrSF', 'X2ndLin',
    'TotalBath', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'Age', 'AgeLin', 'RemodAgeLin','RemodAge',
    'MasVnrArea', 'MasVnrAreaLin',
    'DeckPorchLin','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch'
]

In [10]:
corr_df = df[drop_cands].corr()

In [11]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotFrontage,LotArea,BsmtUnfSF,LowQualFinSF,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,...,RemodAgeLin,RemodAge,MasVnrArea,MasVnrAreaLin,DeckPorchLin,WoodDeckSF,OpenPorchSF,EnclosedPorch,X3SsnPorch,ScreenPorch
LotFrontage,1.0,,,,,,,,,,...,,,,,,,,,,
LotArea,,1.0,,,,,,,,0.985989837216064,...,,,,,,,,,,
BsmtUnfSF,,,1.0,,,,,,,,...,,,,,,,,,,
LowQualFinSF,,,,1.0,,,,,,,...,,,,,,,,,,
LogGrLivArea,,,,,1.0,0.974125055753377,0.8621269929181344,0.7668297864445509,0.9206549563844088,,...,,,,,,,,,,
GrLivArea,,,,,0.974125055753377,1.0,0.8320939400208281,0.76709010302273,0.9131568449887744,,...,,,,,,,,,,
TotalHouseArea,,,,,0.8621269929181344,0.8320939400208281,1.0,0.8038653080395538,0.8767247150827902,,...,,,,,,,,,,
LivArea,,,,,0.7668297864445509,0.76709010302273,0.8038653080395538,1.0,0.9460434912760392,,...,,,,,,,,,,
LivAreaWt,,,,,0.9206549563844088,0.9131568449887744,0.8767247150827902,0.9460434912760392,1.0,,...,,,,,,,,,,
AllSizesSum,,0.985989837216064,,,,,,,,1.0,...,,,,,,,,,,


We'll restrict our drop set to the highly correlated features to make this more readable,

In [12]:
drop_cands = [
    'LotArea', 'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AreasSum',
    'X1stFlrSF','X1stLin', 'TotalBath', 'FullBath', 'MasVnrArea', 'MasVnrAreaLin'
]

In [13]:
corr_df = df[drop_cands].corr()

In [14]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotArea,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,AreasSum,X1stFlrSF,X1stLin,TotalBath,FullBath,MasVnrArea,MasVnrAreaLin
LotArea,1.0,,,,,,0.985989837216064,,,,,,,
LogGrLivArea,,1.0,0.974125055753377,0.8621269929181344,0.7668297864445509,0.9206549563844088,,0.8337296055971176,,,,,,
GrLivArea,,0.974125055753377,1.0,0.8320939400208281,0.76709010302273,0.9131568449887744,,0.8200370464656889,,,,,,
TotalHouseArea,,0.8621269929181344,0.8320939400208281,1.0,0.8038653080395538,0.8767247150827902,,0.9313600210017792,,,,,,
LivArea,,0.7668297864445509,0.76709010302273,0.8038653080395538,1.0,0.9460434912760392,,0.7967334784991007,,,,,,
LivAreaWt,,0.9206549563844088,0.9131568449887744,0.8767247150827902,0.9460434912760392,1.0,,0.8629419501491828,,,,,,
AllSizesSum,0.985989837216064,,,,,,1.0,,,,,,,
AreasSum,,0.8337296055971176,0.8200370464656889,0.9313600210017792,0.7967334784991007,0.8629419501491828,,1.0,,,,,,
X1stFlrSF,,,,,,,,,1.0,0.9783148221289164,,,,
X1stLin,,,,,,,,,0.9783148221289164,1.0,,,,


Let's first compare the pairs.

In [15]:
drop_cands = ['LotArea', 'AllSizesSum']

In [16]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [17]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
2,"('AllSizesSum',)",0.117766572206909,-7.3424275726e-06
1,"('LotArea',)",0.1177734412565202,-4.733779615e-07
0,(),0.1177739146344817,0.0
3,"('LotArea', 'AllSizesSum')",0.1178853206825503,0.0001114060480685


We should keep these features.

In [18]:
drop_cands = ['X1stFlrSF','X1stLin']

In [19]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [20]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
1,"('X1stFlrSF',)",0.1177734111425147,-5.03491967e-07
0,(),0.1177739146344817,0.0
3,"('X1stFlrSF', 'X1stLin')",0.1179182861621037,0.000144371527622
2,"('X1stLin',)",0.1180205787928044,0.0002466641583227


In [21]:
drop_cands = ['TotalBath', 'FullBath']

In [22]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [23]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
2,"('FullBath',)",0.1177734412565203,-4.733779614e-07
0,(),0.1177739146344817,0.0
3,"('TotalBath', 'FullBath')",0.1184471042653967,0.000673189630915
1,"('TotalBath',)",0.1184586643814084,0.0006847497469266


The gain by dropping both of these isn't big enough to consider dropping a predictor that we'd consider strong by hedonic reasoning.

In [24]:
drop_cands = ['MasVnrArea', 'MasVnrAreaLin']

In [25]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [26]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
1,"('MasVnrArea',)",0.1177464575505045,-2.74570839772e-05
2,"('MasVnrAreaLin',)",0.1177734111425149,-5.034919668e-07
0,(),0.1177739146344817,0.0
3,"('MasVnrArea', 'MasVnrAreaLin')",0.1178189087455109,4.49941110292e-05


In [27]:
drop_cands = ['LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AreasSum']

In [28]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [29]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
44,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'AreasSum')",0.11584740756896885761,-0.00192650706551289463
58,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'AreasSum')",0.11609183723881348616,-0.00168207739566826608
46,"('LogGrLivArea', 'GrLivArea', 'LivArea', 'AreasSum')",0.11622059444764813729,-0.00155332018683361495
59,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum')",0.11623034092542024187,-0.00154357370906151037
25,"('LogGrLivArea', 'GrLivArea', 'AreasSum')",0.11623505029948129341,-0.00153886433500045883
47,"('LogGrLivArea', 'GrLivArea', 'LivAreaWt', 'AreasSum')",0.11633757429912623682,-0.00143634033535551542
15,"('GrLivArea', 'AreasSum')",0.11659159189640913579,-0.00118232273807261645
36,"('GrLivArea', 'LivArea', 'AreasSum')",0.11659578339835725835,-0.00117813123612449389
50,"('LogGrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum')",0.11668531058552206181,-0.00108860404895969043
18,"('TotalHouseArea', 'AreasSum')",0.11669604460212060215,-0.00107787003236115009
