# Feature selection
Just because you have a dataset of 30 features (30 variables on the right hand side of your equation), it doesn't mean you have to use all 30 in your model.  Can you think of reasons why it might be benificial to drop certain variables?

Let's use our breast cancer dataset to experiment with feature selection.

In [27]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from patsy import dmatrices
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib as plt
%matplotlib inline

In [47]:
df = pd.read_csv("../../assets/breast-cancer.csv", header=None)
df.iloc[:,1] = df.iloc[:,1] == 'M'#You ask if all the rows in this column are 'M' and it returns True or False
                                  #You'd rather have the boolean than the string since your corrleation table
                                  #will not like categorical variables like 'M' or 'B'
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,True,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,True,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,True,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,True,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,True,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# First, perform a logistic regression on all of the features
(Remember, the first column is just the patient ID--you can ignore that.)

In [29]:
#columns = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

In [30]:
#target = df[1].values
#target = df['diagnosis']
#features = df[[x for x in columns if x not in [0, 1]]].values
#features = df[['meanRadius', 'meanTexture', 'meanPerimeter', 'meanArea', 'meanSmoothness',
#          'meanCompactness', 'meanConcavity', 'meanConcavePoints', 'meanSymmetry', 'meanFractalDimension',
#          'radiusSE', 'textureSE', 'perimeterSE', 'areaSE', 'smoothnessSE',
#          'compactnessSE', 'concavitySE', 'concavePointsSE', 'symmetrySE', 'fractalDimensionSE', 
#           'worstRadius', 'worstTexture', 'worstPerimeter', 'worstArea', 'worstSmoothness',
#          'worstCompactness', 'worstConcavity', 'worstConcavePoints', 'worstSymmetry', 'worstFractalDimension']]
#features.shape

In [31]:
# instantiate a logistic regression model, and fit with X and y
#model = LogisticRegression()
#model = model.fit(features, target)

In [32]:
#model.score(features, target)

In [48]:
y = df.iloc[:,1]
X = df.iloc[:,2:]
model=linear_model.LogisticRegression()
model.fit(X,y)
yHat=model.predict(X)
model.score(X,y)

0.95957820738137078

# But do we need all of the features?
What sort of strategy might one take to drop features?

In [49]:
#Build a coefficients matrix. You don't have stats models to print out a summary report, so do it this way instead.
corr = df.corr()
yXCorr = corr.iloc[1,2:]
yXCorr = abs(yXCorr)#Since we don't really care about a negativeor positive relationship, just its overall strength
yXCorr = pd.DataFrame(yXCorr)
yXCorr.sort_values(by=yXCorr.columns[0], inplace=True)

In [50]:
yXCorr

Unnamed: 0,1
20,0.006522
13,0.008303
11,0.012838
16,0.067016
21,0.077972
18,0.25373
17,0.292999
31,0.323872
10,0.330499
6,0.35856


In [52]:
#So above we see which variables have low coefficients
uncorrX = df.iloc[:,yXCorr.index[0:3]]
uncorrX.head()

Unnamed: 0,20,13,11
0,0.03003,0.9053,0.07871
1,0.01389,0.7339,0.05667
2,0.0225,0.7869,0.05999
3,0.05963,1.156,0.09744
4,0.01756,0.7813,0.05883


In [54]:
model2=linear_model.LogisticRegression()
model2.fit(uncorrX,y)
yHat2=model2.predict(uncorrX)
model2.score(uncorrX,y)
#So we get a low score with just these three variables

0.62741652021089633

In [55]:
corrX = df.iloc[:,yXCorr.index[-3:]]
corrX.head()

Unnamed: 0,9,24,29
0,0.1471,184.6,0.2654
1,0.07017,158.8,0.186
2,0.1279,152.5,0.243
3,0.1052,98.87,0.2575
4,0.1043,152.2,0.1625


In [56]:
model3=linear_model.LogisticRegression()
model3.fit(corrX,y)
yHat3=model3.predict(corrX)
model3.score(corrX,y)

0.92091388400702989

So we drop certain features to:

* Reduce model complexity, which reduces the error and a tendency to overfit
* Adding irrelevant variables lowers your predictable power
* Costs computational time to process huge datasets with a bunch of useless variables
* We can do this manually, but LASSO regression can minimize weak variables to 0 on its own. So that handles feature selection for us
* sklearn also has feature selection packages so you don't have to do this manually