# Feature Selection on the 111001011 dataset

In this notebook we examine the result of dropping subsets of features that are linearly dependent or correlated to some degree.  We're specifically working on the dataset formed by dropping (31, 496, 524, 917, 1183, 1299), since that resulted in the 2nd lowest validation error.

In [1]:
import itertools
import numpy as np
import pandas as pd

pd.set_option('display.precision',20)
pd.set_option('display.max_colwidth',100)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict, KFold, cross_val_score, \
                                    GridSearchCV, RandomizedSearchCV, ShuffleSplit 
from time import time
from scipy.stats import randint as sp_randint

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
rcParams['figure.figsize'] = 12, 4
%matplotlib inline

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [3]:
# Cross-validation sets
kfold = KFold(n_splits=10, random_state=7)

# We are using LassoLarsCV as part of our metric
lr = linear_model.LassoLarsCV(verbose=False, max_iter=5000,precompute='auto', cv=kfold, max_n_alphas=1000, n_jobs=-1)

In [4]:
df = pd.read_csv("./input/train_tidy_111001011.csv")

In [5]:
ss = ShuffleSplit(n_splits=1, test_size=0.20, random_state=573)

In [6]:
X = df.values

In [7]:
for train_idx, validation_idx in ss.split(X):
    train_df = df.iloc[train_idx]
    validation_df = df.iloc[validation_idx]

We will establish a baseline by keeping all features.

In [8]:
y_validation = validation_df['SalePrice'].values
x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
y_train = train_df['SalePrice'].values
x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'],axis=1).values
lr.fit(x_train, y_train)
y_pred = lr.predict(x_validation)
baseline = rmse(y_validation, y_pred)
baseline



0.11607509736404908

## Features

We have a collection of Features, some of which were identified to be of potentially low-quality in predicting the response, others of which are known to be highly correlated with other Features.  We want to identify subsets of features that we can drop to improve the regression.

In [9]:
drop_cands = [
    'LotFrontage', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF',
    'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AllSizesSumLin', 'AreasSum',
    'X1stFlrSF','X1stLin', 'X2ndFlrSF', 'X2ndLin',
    'TotalBath', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'Age', 'AgeLin', 'RemodAgeLin','RemodAge',
    'MasVnrArea', 'MasVnrAreaLin',
    'DeckPorchLin','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch'
]

In [10]:
corr_df = df[drop_cands].corr()

In [11]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotFrontage,LotArea,BsmtUnfSF,LowQualFinSF,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,...,RemodAgeLin,RemodAge,MasVnrArea,MasVnrAreaLin,DeckPorchLin,WoodDeckSF,OpenPorchSF,EnclosedPorch,X3SsnPorch,ScreenPorch
LotFrontage,1.0,,,,,,,,,,...,,,,,,,,,,
LotArea,,1.0,,,,,,,,0.9859962513932156,...,,,,,,,,,,
BsmtUnfSF,,,1.0,,,,,,,,...,,,,,,,,,,
LowQualFinSF,,,,1.0,,,,,,,...,,,,,,,,,,
LogGrLivArea,,,,,1.0,0.9744189233039614,0.8620777000288236,0.7665390104237368,0.9201471380639324,,...,,,,,,,,,,
GrLivArea,,,,,0.9744189233039614,1.0,0.8323040704032365,0.7663793670521539,0.9127579429474992,,...,,,,,,,,,,
TotalHouseArea,,,,,0.8620777000288236,0.8323040704032365,1.0,0.803673933992691,0.8764379547614753,,...,,,,,,,,,,
LivArea,,,,,0.7665390104237368,0.7663793670521539,0.803673933992691,1.0,0.9461723750244064,,...,,,,,,,,,,
LivAreaWt,,,,,0.9201471380639324,0.9127579429474992,0.8764379547614753,0.9461723750244064,1.0,,...,,,,,,,,,,
AllSizesSum,,0.9859962513932156,,,,,,,,1.0,...,,,,,,,,,,


We'll restrict our drop set to the highly correlated features to make this more readable,

In [12]:
drop_cands = [
    'LotArea', 'LogGrLivArea', 
    'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AllSizesSum', 'AreasSum',
    'X1stFlrSF','X1stLin', 'TotalBath', 'FullBath', 'MasVnrArea', 'MasVnrAreaLin'
]

In [13]:
corr_df = df[drop_cands].corr()

In [14]:
corr_df[corr_df > 0.75]

Unnamed: 0,LotArea,LogGrLivArea,GrLivArea,TotalHouseArea,LivArea,LivAreaWt,AllSizesSum,AreasSum,X1stFlrSF,X1stLin,TotalBath,FullBath,MasVnrArea,MasVnrAreaLin
LotArea,1.0,,,,,,0.9859962513932156,,,,,,,
LogGrLivArea,,1.0,0.9744189233039614,0.8620777000288236,0.7665390104237368,0.9201471380639324,,0.8336788219135303,,,,,,
GrLivArea,,0.9744189233039614,1.0,0.8323040704032365,0.7663793670521539,0.9127579429474992,,0.8203626645783046,,,,,,
TotalHouseArea,,0.8620777000288236,0.8323040704032365,1.0,0.803673933992691,0.8764379547614753,,0.9313480826320144,,,,,,
LivArea,,0.7665390104237368,0.7663793670521539,0.803673933992691,1.0,0.9461723750244064,,0.796603177686269,,,,,,
LivAreaWt,,0.9201471380639324,0.9127579429474992,0.8764379547614753,0.9461723750244064,1.0,,0.862807222807978,,,,,,
AllSizesSum,0.9859962513932156,,,,,,1.0,,,,,,,
AreasSum,,0.8336788219135303,0.8203626645783046,0.9313480826320144,0.796603177686269,0.862807222807978,,1.0,,,,,,
X1stFlrSF,,,,,,,,,1.0,0.9782924014873928,,,,
X1stLin,,,,,,,,,0.9782924014873928,1.0,,,,


Let's first compare the pairs.

In [15]:
drop_cands = ['LotArea', 'AllSizesSum']

In [16]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [17]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
1,"('LotArea',)",0.1152921019799307,-0.0007829953841183
2,"('AllSizesSum',)",0.1160530498431961,-2.20475208529e-05
3,"('LotArea', 'AllSizesSum')",0.1160596243253509,-1.54730386981e-05
0,(),0.116075097364049,0.0


We should keep these features.

In [18]:
drop_cands = ['X1stFlrSF','X1stLin']

In [19]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [20]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
1,"('X1stFlrSF',)",0.1153782618848132,-0.0006968354792358
0,(),0.116075097364049,0.0
3,"('X1stFlrSF', 'X1stLin')",0.1160755089566281,4.11592579e-07
2,"('X1stLin',)",0.1164618480370895,0.0003867506730404


In [21]:
drop_cands = ['TotalBath', 'FullBath']

In [22]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [23]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
1,"('TotalBath',)",0.1146798066440508,-0.0013952907199982
2,"('FullBath',)",0.1160596243253508,-1.54730386982e-05
0,(),0.116075097364049,0.0
3,"('TotalBath', 'FullBath')",0.1162046520620499,0.0001295546980008


The gain by dropping both of these isn't big enough to consider dropping a predictor that we'd consider strong by hedonic reasoning.

In [44]:
drop_cands = ['MasVnrArea', 'MasVnrAreaLin']

In [45]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [46]:
output_df

Unnamed: 0,Dropped,RMSE,Diff from Base
0,(),0.1089020355959716,0.0
3,"('MasVnrArea', 'MasVnrAreaLin')",0.1089154100349273,1.33744389556e-05
2,"('MasVnrAreaLin',)",0.1089684727515818,6.64371556101e-05
1,"('MasVnrArea',)",0.1090087118186114,0.0001066762226397


In [24]:
drop_cands = ['LogGrLivArea', 'GrLivArea', 'TotalHouseArea', 'LivArea', 'LivAreaWt', 'AreasSum']

In [None]:
col_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0
for L in range(0, len(drop_cands)+1):
        for subset in itertools.combinations(drop_cands, L):
                drop_cols = list(subset)
                col_drop_results_df.loc[count, 'Dropped'] = str(subset)
                x_train = train_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                x_validation = validation_df.drop(['HouseId', 'SalePrice', 'GarageAge', 'GarageAgeLin'] + drop_cols,axis=1).values
                lr.fit(x_train, y_train)
                y_pred = lr.predict(x_validation)
                error = rmse(y_validation, y_pred)
                col_drop_results_df.loc[count, 'RMSE'] = error
                col_drop_results_df.loc[count, 'Diff from Base'] = error - baseline                
                count += 1
output_df = col_drop_results_df.sort_values(['RMSE'])



In [None]:
output_df

In [50]:
output_df.to_csv('col_drop_results_111001011.csv', header=True)

The best result, dropping  	('LogGrLivArea', 'TotalHouseArea', 'LivAreaWt', 'AreasSum') is a modest gain.  Again, we can choose other top candiates to generate additional datasets to use in an ensemble.