# Feature Selection on the 111101111 dataset

In this notebook we examine the result of dropping subsets of features that are linearly dependent or correlated to some degree.  We're specifically working on the dataset formed by dropping (31, 496, 524, 534, 917, 969, 1183, 1299), since that resulted in the 5th lowest validation error.

In [1]:
import itertools
import numpy as np
import pandas as pd

pd.set_option('display.precision',20)
pd.set_option('display.max_colwidth',100)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict, KFold, cross_val_score, \
                                    GridSearchCV, RandomizedSearchCV, ShuffleSplit 
from time import time
from scipy.stats import randint as sp_randint

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
rcParams['figure.figsize'] = 12, 4
%matplotlib inline

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [3]:
# Cross-validation sets
kfold = KFold(n_splits=10, random_state=7)

# We are using LassoLarsCV as part of our metric
lr = linear_model.LassoLarsCV(verbose=False, max_iter=5000,precompute='auto', cv=kfold, max_n_alphas=1000, n_jobs=-1)

In [4]:
df = pd.read_csv("./input/train_tidy_111101111.csv")

In [5]:
ss = ShuffleSplit(n_splits=1, test_size=0.20, random_state=573)

In [6]:
X = df.values

In [7]:
for train_idx, validation_idx in ss.split(X):
    train_df = df.iloc[train_idx]
    validation_df = df.iloc[validation_idx]

We will establish a baseline by keeping all features.

In [8]:
y_validation = validation_df['SalePrice'].values
x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
y_train = train_df['SalePrice'].values
x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
lr.fit(x_train, y_train)
y_pred = lr.predict(x_validation)
baseline = rmse(y_validation, y_pred)
baseline



0.10398725860824831

## Features

We have a collection of Features, some of which were identified to be of potentially low-quality in predicting the response, others of which are known to be highly correlated with other Features.  We want to identify subsets of features that we can drop to improve the regression.

In [9]:
drop_cands = [
    'LotFrontage', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF',
    'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AllSizesSumLin', 'AreasSum',
    'X1stFlrSF','X1stLin', 'X2ndFlrSF', 'X2ndLin',
    'TotalBath', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'Age', 'AgeLin', 'RemodAgeLin','RemodAge',
    'MasVnrArea', 'MasVnrAreaLin',
    'DeckPorchLin','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch'
]

In [10]:
corr_df = df[drop_cands].corr()

In [11]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotFrontage,LotArea,BsmtUnfSF,LowQualFinSF,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,...,RemodAgeLin,RemodAge,MasVnrArea,MasVnrAreaLin,DeckPorchLin,WoodDeckSF,OpenPorchSF,EnclosedPorch,X3SsnPorch,ScreenPorch
LotFrontage,1.0,,,,,,,,,,...,,,,,,,,,,
LotArea,,1.0,,,,,,,,0.986079408145959,...,,,,,,,,,,
BsmtUnfSF,,,1.0,,,,,,,,...,,,,,,,,,,
LowQualFinSF,,,,1.0,,,,,,,...,,,,,,,,,,
LogGrLivArea,,,,,1.0,0.9757104048378819,0.8604402726916431,0.7632868098973198,0.9190208417314308,,...,,,,,,,,,,
GrLivArea,,,,,0.9757104048378819,1.0,0.8354586492266463,0.7674492056729174,0.9145474715895584,,...,,,,,,,,,,
TotalHouseArea,,,,,0.8604402726916431,0.8354586492266463,1.0,0.8019416365972616,0.8756392020364028,,...,,,,,,,,,,
LivArea,,,,,0.7632868098973198,0.7674492056729174,0.8019416365972616,1.0,0.9452567587491754,,...,,,,,,,,,,
LivAreaWt,,,,,0.9190208417314308,0.9145474715895584,0.8756392020364028,0.9452567587491754,1.0,,...,,,,,,,,,,
AllSizesSum,,0.986079408145959,,,,,,,,1.0,...,,,,,,,,,,


We'll restrict our drop set to the highly correlated features to make this more readable,

In [12]:
drop_cands = [
    'LotArea', 'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AreasSum',
    'X1stFlrSF','X1stLin', 'TotalBath', 'FullBath', 'MasVnrArea', 'MasVnrAreaLin'
]

In [13]:
corr_df = df[drop_cands].corr()

In [14]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotArea,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,AreasSum,X1stFlrSF,X1stLin,TotalBath,FullBath,MasVnrArea,MasVnrAreaLin
LotArea,1.0,,,,,,0.986079408145959,,,,,,,
LogGrLivArea,,1.0,0.9757104048378819,0.8604402726916431,0.7632868098973198,0.9190208417314308,,0.8323976831762752,,,,,,
GrLivArea,,0.9757104048378819,1.0,0.8354586492266463,0.7674492056729174,0.9145474715895584,,0.8205514194042111,,,,,,
TotalHouseArea,,0.8604402726916431,0.8354586492266463,1.0,0.8019416365972616,0.8756392020364028,,0.932924425208797,,,,,,
LivArea,,0.7632868098973198,0.7674492056729174,0.8019416365972616,1.0,0.9452567587491754,,0.7950502382033682,,,,,,
LivAreaWt,,0.9190208417314308,0.9145474715895584,0.8756392020364028,0.9452567587491754,1.0,,0.8617028205556126,,,,,,
AllSizesSum,0.986079408145959,,,,,,1.0,,,,,,,
AreasSum,,0.8323976831762752,0.8205514194042111,0.932924425208797,0.7950502382033682,0.8617028205556126,,1.0,,,,,,
X1stFlrSF,,,,,,,,,1.0,0.9790535893861382,,,,
X1stLin,,,,,,,,,0.9790535893861382,1.0,,,,


Let's first compare the pairs.

In [15]:
drop_cands = ['LotArea', 'AllSizesSum']

In [16]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [17]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1039872586082483,0.0
2,"('AllSizesSum',)",0.1040789552519688,9.16966437205e-05
3,"('LotArea', 'AllSizesSum')",0.1041231111608038,0.0001358525525555
1,"('LotArea',)",0.1044119356067539,0.0004246769985056


We should keep these features.

In [18]:
drop_cands = ['X1stFlrSF','X1stLin']

In [19]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [20]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1039872586082483,0.0
3,"('X1stFlrSF', 'X1stLin')",0.1040235757438792,3.63171356309e-05
2,"('X1stLin',)",0.1041152342766769,0.0001279756684286
1,"('X1stFlrSF',)",0.1050028442833461,0.0010155856750977


In [21]:
drop_cands = ['TotalBath', 'FullBath']

In [22]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [23]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1039872586082483,0.0
1,"('TotalBath',)",0.1040298596982717,4.26010900234e-05
2,"('FullBath',)",0.1044119356067539,0.0004246769985056
3,"('TotalBath', 'FullBath')",0.105474369753301,0.0014871111450526


The gain by dropping both of these isn't big enough to consider dropping a predictor that we'd consider strong by hedonic reasoning.

In [24]:
drop_cands = ['MasVnrArea', 'MasVnrAreaLin']

In [25]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [26]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
3,"('MasVnrArea', 'MasVnrAreaLin')",0.1039834680223525,-3.7905858957e-06
0,(),0.1039872586082483,0.0
2,"('MasVnrAreaLin',)",0.1040896092766,0.0001023506683517
1,"('MasVnrArea',)",0.1050635749467884,0.0010763163385401


In [27]:
drop_cands = ['LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AreasSum']

In [28]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [29]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
59,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum')",0.10360817226279898928,-0.00037908634544932263
58,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'AreasSum')",0.10365739713993106508,-0.00032986146831724683
44,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'AreasSum')",0.10366177279650089227,-0.00032548581174741964
47,"('LogGrLivArea', 'GrLivArea', 'LivAreaWt', 'AreasSum')",0.10372816811309794327,-0.00025909049515036864
24,"('LogGrLivArea', 'GrLivArea', 'LivAreaWt')",0.10375766254330309746,-0.00022959606494521445
22,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea')",0.10386160988924549031,-0.00012564871900282160
43,"('LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivAreaWt')",0.10387194813726273457,-0.00011531047098557734
0,(),0.10398725860824831191,0.00000000000000000000
12,"('GrLivArea', 'TotalHouseArea')",0.10407926191761297796,0.00009200330936466605
1,"('LogGrLivArea',)",0.10408960927660008966,0.00010235066835177775
