# Feature selection
Just because you have a dataset of 30 features (30 variables on the right hand side of your equation), it doesn't mean you have to use all 30 in your model.  Can you think of reasons why it might be benificial to drop certain variables?

Let's use our breast cancer dataset to experiment with feature selection.

In [76]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import matplotlib as plt
import seaborn as sns
%matplotlib inline

In [77]:
df = pd.read_csv("../../assets/breast-cancer.csv", header=None)
df.iloc[:,1] = df.iloc[:,1] == 'M'
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,True,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,True,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,True,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,True,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,True,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# First, perform a logistic regression on all of the features
(Remember, the first column is just the patient ID--you can ignore that.)

In [78]:
y = df.iloc[:,1]
X = df.iloc[:,2:]
model = linear_model.LogisticRegression()
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.95957820738137078

# But do we need all of the features?
What sort of strategy might one take to drop features?  What if we used the correlation between the x variables and the y variable?

In [79]:
corr = df.corr()#assign correlation table to a variable
yXCorr = corr.iloc[1,2:]#This joins the features to the y variable and gives corr table(df.corr() but no unique id column)
yXCorr = abs(yXCorr)#We want to compare their coefficients so do absolute value to ignore confusion of negatives
yXCorr = pd.DataFrame(yXCorr)#Changes it to a dataframe
yXCorr.sort_values(by=yXCorr.columns[0],inplace=True)#sort the values. default sort is lowest to highest at bottom
X = df.iloc[:,yXCorr.index[-3:]]#This is where you identify the 3 strongest
model.fit(X,y)#fit a model to just the three strongest variables, not the whole thing
yHat = model.predict(X)#make predictions for y values based on rows for the 3 variables identified in yXcorr as strongest
model.score(X,y)

0.92091388400702989

In [80]:
yXCorr.tail(3)

Unnamed: 0,1
9,0.776614
24,0.782914
29,0.793566


In [81]:
X.head()

Unnamed: 0,9,24,29
0,0.1471,184.6,0.2654
1,0.07017,158.8,0.186
2,0.1279,152.5,0.243
3,0.1052,98.87,0.2575
4,0.1043,152.2,0.1625


### Let's look at the correlations between the "three best features" according to our "most correlated with y" approach.
What can you say about how these features are correlated with each other?

In [82]:
#The variables correlate highly with one another.
#Variables 29 and 9 are close to perfectly correlating so you may be able to drop one, since it's possible they're
#just telling you the same information
X.corr()

Unnamed: 0,9,24,29
9,1.0,0.855923,0.910155
24,0.855923,1.0,0.816322
29,0.910155,0.816322,1.0


### Let's also add our y variable, and look at its correlation numbers, for future reference.

In [83]:
#X['y'] = y

yDf = pd.DataFrame(y)
yX = yDf.join(X)
yX.corr()

Unnamed: 0,1,9,24,29
1,1.0,0.776614,0.782914,0.793566
9,0.776614,1.0,0.855923,0.910155
24,0.782914,0.855923,1.0,0.816322
29,0.793566,0.910155,0.816322,1.0


As you can see in the score above, our model score doesn't decrease by much, and we are only using three features.  To drive home the point that the correlation matters, let's repeat our test with the three least correlated variables.

In [85]:
X = df.iloc[:,yXCorr.index[:3]]
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.62741652021089633

In [86]:
yXCorr.head(3)

Unnamed: 0,1
20,0.006522
13,0.008303
11,0.012838


# Use SelectKBest to get the best 3 features using chi2.
Refit the model, repredict, and reprint out the score.

In [88]:
#Don't forget to return X to the original variables (all of the features)
#X is a listo f all of my predictors
X = df.iloc[:,2:]

In [89]:
# Perform feature selection
X = SelectKBest(chi2, k=3).fit_transform(X,y)#This will get the chi^2* for the three BEST predictors/features from X

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

In [91]:
model.fit(X,y)
yHat2 = model.predict(X)
model.score(X,y)

0.95957820738137078

# Can we make an improvement in the score?
Let's try a "brute force search" (exhaustive) to see if we can find three features which give us a better score.  So, go through every combination of x variables (limiting to three x's per run) and fit and score your model, keeping track of the best score and the best x's.

In [92]:
# Don't forget to reset X to the original variables (all of the features)

X = df.iloc[:,2:]
bestScore = 0 
#you give a score to start, and update the bestScore in the loop every time a combo score is higher than prev. bestScore
bestFeatures = [-1,-1,-1] #initialize the names of the features with nonsense values to start
for i in X.columns:
    for j in X.columns:
        if (j <= i):
            continue #This will force it to skip to the next combo if one of the variables are the same or lower so we
                     #don't repeat combinations in different order.
        for k in X.columns:
            if (k <= j):
                continue
            Xtemp = df.loc[:,[i,j,k]]
            model.fit(Xtemp, y)
            currentScore = model.score(Xtemp, y)
            if (currentScore > bestScore):
                bestScore = currentScore #after loop and model fit, you change bestScore to the currentScore
                bestFeatures = [i,j,k] #and you list the names of the three features that gave us that score
print bestScore
print bestFeatures
#Notice that the best score with 3 variables have variables with lower correlations than our highest 3

0.952548330404
[22, 23, 28]


# How do these best features compare with the first three features we found?
Originally, we chose the three x's that were most correlated with y.  We obtained a solid score, but now we have obtained a better score.  Why do you think that is?
## Let's examine the correlation matrix of our new three best features and y