# Feature Selection on the 111001001 dataset

In this notebook we examine the result of dropping subsets of features that are linearly dependent or correlated to some degree.  We're specifically working on the dataset formed by dropping (31, 496, 524, 917, 1299), since that resulted in the lowest validation error.

In [6]:
import itertools
import numpy as np
import pandas as pd

pd.set_option('display.precision',20)
pd.set_option('display.max_colwidth',100)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict, KFold, cross_val_score, \
                                    GridSearchCV, RandomizedSearchCV, ShuffleSplit 
from time import time
from scipy.stats import randint as sp_randint

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
rcParams['figure.figsize'] = 12, 4
%matplotlib inline

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [11]:
# Cross-validation sets
kfold = KFold(n_splits=10, random_state=7)

# We are using LassoLarsCV as part of our metric
lr = linear_model.LassoLarsCV(verbose=False, max_iter=5000,precompute='auto', cv=kfold, max_n_alphas=1000, n_jobs=-1)

In [4]:
df = pd.read_csv("./input/train_tidy_111001001.csv")

In [7]:
ss = ShuffleSplit(n_splits=1, test_size=0.20, random_state=573)

In [8]:
X = df.values

In [9]:
for train_idx, validation_idx in ss.split(X):
    train_df = df.iloc[train_idx]
    validation_df = df.iloc[validation_idx]

We will establish a baseline by keeping all features.

In [13]:
y_validation = validation_df['SalePrice'].values
x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
y_train = train_df['SalePrice'].values
x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
lr.fit(x_train, y_train)
y_pred = lr.predict(x_validation)
baseline = rmse(y_validation, y_pred)
baseline



0.10890203559597167

This rmse of 0.10890203559597167 is actually quite a bit better (relatively speaking) than the baseline of 0.12503266782864195 over the whole dataset and the rmse of 0.11751015503521701488 we obtained by merely dropping points from that dataset.  This is because the R code has dropped the outliers before preprocessing and so several engineered features have been refit to this dataset. 

## Features

We have a collection of Features, some of which were identified to be of potentially low-quality in predicting the response, others of which are known to be highly correlated with other Features.  We want to identify subsets of features that we can drop to improve the regression.

In [14]:
drop_cands = [
    'LotFrontage', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF',
    'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AllSizesSumLin', 'AreasSum',
    'X1stFlrSF','X1stLin', 'X2ndFlrSF', 'X2ndLin',
    'TotalBath', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'Age', 'AgeLin', 'RemodAgeLin','RemodAge',
    'MasVnrArea', 'MasVnrAreaLin',
    'DeckPorchLin','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch'
]

In [17]:
corr_df = df[drop_cands].corr()

In [19]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotFrontage,LotArea,BsmtUnfSF,LowQualFinSF,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,...,RemodAgeLin,RemodAge,MasVnrArea,MasVnrAreaLin,DeckPorchLin,WoodDeckSF,OpenPorchSF,EnclosedPorch,X3SsnPorch,ScreenPorch
LotFrontage,1.0,,,,,,,,,,...,,,,,,,,,,
LotArea,,1.0,,,,,,,,0.9857428216413608,...,,,,,,,,,,
BsmtUnfSF,,,1.0,,,,,,,,...,,,,,,,,,,
LowQualFinSF,,,,1.0,,,,,,,...,,,,,,,,,,
LogGrLivArea,,,,,1.0,0.9728471135046108,0.863172565804008,0.7685355486502758,0.9208469956238144,,...,,,,,,,,,,
GrLivArea,,,,,0.9728471135046108,1.0,0.8327416426526644,0.7699351845077972,0.9141842787267807,,...,,,,,,,,,,
TotalHouseArea,,,,,0.863172565804008,0.8327416426526644,1.0,0.8050487642385311,0.8772149199222812,,...,,,,,,,,,,
LivArea,,,,,0.7685355486502758,0.7699351845077972,0.8050487642385311,1.0,0.946727520954228,,...,,,,,,,,,,
LivAreaWt,,,,,0.9208469956238144,0.9141842787267807,0.8772149199222812,0.946727520954228,1.0,,...,,,,,,,,,,
AllSizesSum,,0.9857428216413608,,,,,,,,1.0,...,,,,,,,,,,


We'll restrict our drop set to the highly correlated features to make this more readable,

In [29]:
drop_cands = [
    'LotArea', 'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AreasSum',
    'X1stFlrSF','X1stLin', 'TotalBath', 'FullBath', 'MasVnrArea', 'MasVnrAreaLin'
]

In [30]:
corr_df = df[drop_cands].corr()

In [31]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotArea,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,AreasSum,X1stFlrSF,X1stLin,TotalBath,FullBath,MasVnrArea,MasVnrAreaLin
LotArea,1.0,,,,,,0.9857428216413608,,,,,,,
LogGrLivArea,,1.0,0.9728471135046108,0.863172565804008,0.7685355486502758,0.9208469956238144,,0.834771823729551,,,,,,
GrLivArea,,0.9728471135046108,1.0,0.8327416426526644,0.7699351845077972,0.9141842787267807,,0.8211232829867705,,,,,,
TotalHouseArea,,0.863172565804008,0.8327416426526644,1.0,0.8050487642385311,0.8772149199222812,,0.9316843070445244,,,,,,
LivArea,,0.7685355486502758,0.7699351845077972,0.8050487642385311,1.0,0.946727520954228,,0.798434854896966,,,,,,
LivAreaWt,,0.9208469956238144,0.9141842787267807,0.8772149199222812,0.946727520954228,1.0,,0.8639227146638802,,,,,,
AllSizesSum,0.9857428216413608,,,,,,1.0,,,,,,,
AreasSum,,0.834771823729551,0.8211232829867705,0.9316843070445244,0.798434854896966,0.8639227146638802,,1.0,,,,,,
X1stFlrSF,,,,,,,,,1.0,0.9781566931924364,,,,
X1stLin,,,,,,,,,0.9781566931924364,1.0,,,,


Let's first compare the pairs.

In [32]:
drop_cands = ['LotArea', 'AllSizesSum']

In [36]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [37]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1089020355959716,0.0
2,"('AllSizesSum',)",0.1090726115501068,0.0001705759541352
3,"('LotArea', 'AllSizesSum')",0.1092386087374811,0.0003365731415094
1,"('LotArea',)",0.1094126320562572,0.0005105964602855


We should keep these features.

In [38]:
drop_cands = ['X1stFlrSF','X1stLin']

In [39]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [40]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1089020355959716,0.0
1,"('X1stFlrSF',)",0.1091514098798149,0.0002493742838432
2,"('X1stLin',)",0.1091546665817606,0.0002526309857889
3,"('X1stFlrSF', 'X1stLin')",0.109316047212378,0.0004140116164064


In [41]:
drop_cands = ['TotalBath', 'FullBath']

In [42]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [43]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
3,"('TotalBath', 'FullBath')",0.1088135076854177,-8.85279105539e-05
0,(),0.1089020355959716,0.0
1,"('TotalBath',)",0.1093681780094671,0.0004661424134954
2,"('FullBath',)",0.1094122455938405,0.0005102099978688


The gain by dropping both of these isn't big enough to consider dropping a predictor that we'd consider strong by hedonic reasoning.

In [44]:
drop_cands = ['MasVnrArea', 'MasVnrAreaLin']

In [45]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [46]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1089020355959716,0.0
3,"('MasVnrArea', 'MasVnrAreaLin')",0.1089154100349273,1.33744389556e-05
2,"('MasVnrAreaLin',)",0.1089684727515818,6.64371556101e-05
1,"('MasVnrArea',)",0.1090087118186114,0.0001066762226397


In [47]:
drop_cands = ['LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AreasSum']

In [48]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [49]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
50,"('LogGrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum')",0.10722313108186661001,-0.00167890451410505903
28,"('LogGrLivArea', 'TotalHouseArea', 'AreasSum')",0.10746951750457771346,-0.00143251809139395558
59,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum')",0.10752050532433635177,-0.00138153027163531728
44,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'AreasSum')",0.10752517460839859653,-0.00137686098757307251
58,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'AreasSum')",0.10760185118523347969,-0.00130018441073818936
49,"('LogGrLivArea', 'TotalHouseArea', 'LivArea', 'AreasSum')",0.10771577702029228041,-0.00118625857567938864
39,"('TotalHouseArea', 'LivArea', 'AreasSum')",0.10774770163508118337,-0.00115433396089048568
31,"('LogGrLivArea', 'LivAreaWt', 'AreasSum')",0.10777198241332501538,-0.00113005318264665366
21,"('LivAreaWt', 'AreasSum')",0.10777288039828231136,-0.00112915519768935768
47,"('LogGrLivArea', 'GrLivArea', 'LivAreaWt', 'AreasSum')",0.10784821323846748020,-0.00105382235750418884


In [50]:
output_df.to_csv('col_drop_results_111001001.csv', header=True)

The best result, dropping  	('LogGrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum') is a modest gain.  Again, we can choose other top candiates to generate additional datasets to use in an ensemble.