# Variable Selection


We are now going to practice **Variable selection** on our _House Prices_ dataset.

## Variable Selection Tutorial

We can continue in the same notebook as in the previous activity Feature Engineering


We are going to implement elements for _filter feature selectors_ based on the following criteria:

- Small variance
- One of each pair of features, which are correlated together more than x

Before doing any transformations we will extract our target variable to keep it as it is. Even though we can do some transformations to it, it is a good practice to do it separately:


In [9]:
import pandas as pd 

df=pd.read_csv('df_final.csv')

In [15]:
df=df.drop(['Unnamed: 0','Home'],axis=1)

In [16]:
y = df.Price
df.drop("Price",axis=1, inplace=True)


In [17]:
df

Unnamed: 0,SqFt,Bedrooms,Bathrooms,Offers,Number_Rooms,Brick_No,Brick_Yes,Neighborhood_East,Neighborhood_North,Neighborhood_West
0,1790,2,2,2,4,1,0,1,0,0
1,2030,4,2,3,6,1,0,1,0,0
2,1740,3,2,1,5,1,0,1,0,0
3,1980,3,2,3,5,1,0,1,0,0
4,2130,3,3,3,6,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...
123,1900,3,3,3,6,0,1,1,0,0
124,2160,4,3,3,7,0,1,1,0,0
125,2070,2,2,2,4,1,0,0,1,0
126,2020,3,3,1,6,1,0,0,0,1



### Part 1: Removing Features With Small Variance

First of all, we will remove the columns with very little variance. Small variance equals small predictive power because all houses have very similar values.


For most of our variable selection, we can use methods from `sklearn`:


In [19]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(0.1)
df_transformed = vt.fit_transform(df)

In [20]:
df_transformed

array([[1790,    2,    2, ...,    1,    0,    0],
       [2030,    4,    2, ...,    1,    0,    0],
       [1740,    3,    2, ...,    1,    0,    0],
       ...,
       [2070,    2,    2, ...,    0,    1,    0],
       [2020,    3,    3, ...,    0,    0,    1],
       [2250,    3,    3, ...,    0,    1,    0]])

> #### Instruction
> Check the number of variables in the table and find out how many features we have deleted.

<!-- -->

> #### Warning
> As previously mentioned, `fit_transform()` in `sklearn` transforms an object from DataFrame to `numpy.array` and we are losing column names, so we need to do some tricks to get them back!

<!-- -->

> #### Note
> We don't need column names for modeling but it helps with the interpretation of modeling results.


In [21]:
# columns we have selected
# get_support() is method of VarianceThreshold and stores boolean of each variable in the numpy array.
selected_columns = df.columns[vt.get_support()]
# transforming an array back to a data-frame preserves column labels
df_transformed = pd.DataFrame(df_transformed, columns = selected_columns)



### Part 2: Removing Correlated Features

The goal of this part is to remove one feature from each highly correlated pair.


We are going to do this in 3 steps:

1. Calculate a correlation matrix
2. Get pairs of highly correlated features
3. Remove correlated columns


In [24]:
import numpy as np

df_corr = df_transformed.corr().abs()

indices = np.where(df_corr > 0.8) 
indices = [(df_corr.index[x], df_corr.columns[y]) 
for x, y in zip(*indices)
    if x != y and x < y]

for idx in indices: #each pair
    try:
        df_transformed.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass


The code above will drop one column from each pair that is correlated at least `0.8`. If this happens twice, use try-except block to allow the code to continue even when `KeyError` occurs.

We can check the correlated columns by printing the indices:


In [25]:
print(indices)

[('Bedrooms', 'Number_Rooms'), ('Brick_No', 'Brick_Yes')]


> #### Instruction
> Check the number of variables in the table and find out how many features we have deleted.


### Part 3: Forward Regression

We have removed the features with **no information** and **correlated features** so far. The last thing we will do before modeling is to select the k-best features in terms of the relationship with the `target` variable. We will use _the forward wrapper_ method for that:

In [31]:
from sklearn.feature_selection import f_regression, SelectKBest
skb = SelectKBest(f_regression, k=3)
X = skb.fit_transform(df_transformed, y)

In [32]:
X

array([[1790,    0,    0],
       [2030,    0,    0],
       [1740,    0,    0],
       [1980,    0,    0],
       [2130,    0,    0],
       [1780,    1,    0],
       [1830,    0,    1],
       [2160,    0,    1],
       [2110,    0,    0],
       [1730,    0,    0],
       [2030,    0,    0],
       [1870,    0,    0],
       [1910,    1,    0],
       [2150,    1,    0],
       [2590,    0,    1],
       [1780,    0,    1],
       [2190,    0,    0],
       [1990,    1,    0],
       [1700,    0,    0],
       [1920,    0,    1],
       [1790,    0,    0],
       [2000,    1,    0],
       [1690,    1,    0],
       [1820,    1,    0],
       [2210,    0,    0],
       [2290,    1,    0],
       [2000,    0,    1],
       [1700,    0,    0],
       [1600,    1,    0],
       [2040,    0,    1],
       [2250,    0,    1],
       [1930,    1,    0],
       [2250,    0,    0],
       [2280,    0,    0],
       [2000,    1,    0],
       [2080,    1,    0],
       [1880,    1,    0],
 


We need to import the `SelectKBest` method. Plus, we have to decide what algorithm we are going to use for the actual selection. Since we want to do _a forward regression_, we also imported `f_regression`. We could use some other technique if, for example, the `target` variable was categorical.

> #### Note
> We have assigned our `target` variable `SalePrice` into `y` in the beginning of this tutorial.

<!-- -->

> #### Warning
> The type of X was again changed to `array`.

<!-- -->

> #### Instruction
> Convert `X` back to a data-frame and assign back the correct column names.
>
> HINT: Use the method `get_support()` from the `SelectKBest` instance to find the features that were selected.
>
> **Try to do it before looking at the solution below.**




In [35]:
# this will give us the position of top 3 columns
skb.get_support()
# column names
df_transformed.columns[skb.get_support()]
X = pd.DataFrame(X,columns=df_transformed.columns[skb.get_support()])


In [36]:
X

Unnamed: 0,SqFt,Neighborhood_North,Neighborhood_West
0,1790,0,0
1,2030,0,0
2,1740,0,0
3,1980,0,0
4,2130,0,0
...,...,...,...
123,1900,0,0
124,2160,0,0
125,2070,1,0
126,2020,0,1


Now, X consists of 3 features which should be pretty good predictors of our target variable, Price.
