We are going to implement elements for filter feature selectors based on the following criteria:

* Small variance
* One of each pair of features, which are correlated together more than x

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_numeric = pd.read_csv('df_numeric.csv')
df_numeric.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,Street,LotShape,Utilities,LandSlope,OverallQual,OverallCond,YearBuilt,...,MoSold,YrSold,SalePrice,GarageYrBlt_missing_ind,LotFrontage_missing_ind,MasVnrArea_missing_ind,1stFlrSF_log,1stFlr_2ndFlr_SF,OverallGrade,SimplGarageQual
0,60,65.0,8450,2,4,4,3,7,5,2003,...,2,2008,208500,0,0,0,6.75227,1710,35,1
1,20,80.0,9600,2,4,4,3,6,8,1976,...,5,2007,181500,0,0,0,7.140453,1262,48,1
2,60,68.0,11250,2,3,4,3,7,5,2001,...,9,2008,223500,0,0,0,6.824374,1786,35,1
3,70,60.0,9550,2,3,4,3,7,5,1915,...,2,2006,140000,0,0,0,6.867974,1717,35,1
4,60,84.0,14260,2,3,4,3,8,5,2000,...,12,2008,250000,0,0,0,7.04316,2198,40,1


Before doing any transformations we will extract our target variable to keep it as it is. Even though we can do some transformations to it, it is a good practice to do it separately:

In [3]:
y = df_numeric.SalePrice
df_numeric.drop('SalePrice', axis=1, inplace=True)

## 1. Removing Features with Small Variance

First of all, we will remove the columns with very little variance. Small variance equals small predictive power because all houses have very similar values.

For most of our variable selection, we can use methods from sklearn:

In [4]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(0.1)
df_transformed = vt.fit_transform(df_numeric)

In [5]:
df_transformed.shape # 10 columns deleted

(1458, 50)

### Adding column names back for readability

In [6]:
# columns we have selected
# get_support() is method of VarianceThreshold and stores boolean of each variable in the numpy array.
selected_columns = df_numeric.columns[vt.get_support()]

# transforming an array back to a data-frame preserves column labels
df_transformed = pd.DataFrame(df_transformed, columns = selected_columns)

## 2. Removing Correlated Features

The goal of this part is to remove one feature from each highly correlated pair.

We are going to do this in 3 steps:

* Calculate a correlation matrix
* Get pairs of highly correlated features
* Remove correlated columns

In [7]:
# step 1
df_corr = df_transformed.corr().abs()

# step 2
indices = np.where(df_corr > 0.8) 
indices = [(df_corr.index[x], df_corr.columns[y]) 
for x, y in zip(*indices)
    if x != y and x < y]

# step 3
for idx in indices: #each pair
    try:
        df_transformed.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass

In [8]:
print(indices)

[('TotalBsmtSF', '1stFlrSF'), ('GrLivArea', 'TotRmsAbvGrd'), ('GrLivArea', '1stFlr_2ndFlr_SF'), ('TotRmsAbvGrd', '1stFlr_2ndFlr_SF'), ('GarageCars', 'GarageArea'), ('GarageQual', 'GarageCond')]


## 3. Forward Regression

We have removed the features with no information and correlated features so far. 

The last thing we will do before modeling is to select the k-best features in terms of the relationship with the target variable. 

We will use the forward wrapper method for that:

In [9]:
from sklearn.feature_selection import f_regression, SelectKBest
skb = SelectKBest(f_regression, k=10)
X = skb.fit_transform(df_transformed, y)

The type of X was again changed to array.

Convert X back to a data-frame and assign back the correct column names.

In [11]:
# this will give us the position of top 10 columns
skb.get_support()

# column names
df_transformed.columns[skb.get_support()]
X = pd.DataFrame(X,columns=df_transformed.columns[skb.get_support()])

In [13]:
X

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade
0,7.0,2003.0,4.0,4.0,856.0,1710.0,2.0,4.0,2.0,35.0
1,6.0,1976.0,3.0,4.0,1262.0,1262.0,2.0,3.0,2.0,48.0
2,7.0,2001.0,4.0,4.0,920.0,1786.0,2.0,4.0,2.0,35.0
3,7.0,1915.0,3.0,3.0,756.0,1717.0,1.0,4.0,3.0,35.0
4,8.0,2000.0,4.0,4.0,1145.0,2198.0,2.0,4.0,3.0,40.0
...,...,...,...,...,...,...,...,...,...,...
1453,6.0,1999.0,3.0,4.0,953.0,1647.0,2.0,3.0,2.0,30.0
1454,6.0,1978.0,3.0,4.0,1542.0,2073.0,2.0,3.0,2.0,36.0
1455,7.0,1941.0,5.0,3.0,1152.0,2340.0,2.0,4.0,1.0,63.0
1456,5.0,1950.0,3.0,3.0,1078.0,1078.0,1.0,4.0,1.0,30.0
