## Feature Selection ##

In this notebook, we will see how can we decide the best predictors for our machine learning model.

We'll use the Titanic Dataset available on Kaggle

There are 3 methods for feature selection:

1. [Filter method](#filter): Chose features based on the how statistically they are significant in determining the response variable

2. [Wrapper method](#wrapper): Instead of statisical tests, a model is fed with subset of features that give the best results. We'll see how to form these subsets in a while.

3. [Embedded method](#embedded): Mixture of both Filter and Wrapper method

#### Read the Titanic data ####

In [1]:
import pandas as pd
train = pd.read_csv('data/train.csv'
                     )

In [2]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


1. PassengerId
2. Survived: Survived or Not
3. Pclass: Class of Travel
4. Name: Name of Passenger
5. Sex: Gender
6. Age: Age of Passengers
7. SibSp: Number of Sibling/Spouse aboard
8. Parch: Number of Parent/Child aboard
9. Ticket
10. Fare
11. Cabin
12. Embarked: The port in which a passenger has embarked. C - Cherbourg, S - Southampton, Q = Queenstown

Let's pick our target/ response variable as Survived (Whether passenger survived the accident?)

Also remove the obvious columns that are not useful for prediction

In [3]:
train.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)
y = train['Survived']
train.drop(['Survived'],axis=1, inplace=True)

In [4]:
train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,,S
1,1,female,38.0,1,0,71.2833,C85,C
2,3,female,26.0,0,0,7.925,,S
3,1,female,35.0,1,0,53.1,C123,S
4,3,male,35.0,0,0,8.05,,S


Time for some **pre-processing**
1. We can't pass string values in models and statistic functions, so we'll have to encode them
2. Similarly, models hate to work with NaNs, so we'll have to handle them. We can either drop those records or  replace them with the most occuring value. Let's replace as the dataset is already not BIG enough.

In [5]:
print(train['Sex'].value_counts(dropna=False))
train['Sex'] = train['Sex'].map({'male':0,'female':1})
print(train['Sex'].value_counts(dropna=False))

male      577
female    314
Name: Sex, dtype: int64
0    577
1    314
Name: Sex, dtype: int64


In [6]:
train['Pclass'].unique()

array([3, 1, 2])

In [7]:
print(train['Cabin'].value_counts(dropna=False)[:10]) #Showing top 10 most frequent values
train['Cabin'] = train['Cabin'].fillna('G6')

NaN            687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
F33              3
D                3
E101             3
F2               3
C93              2
Name: Cabin, dtype: int64


In [8]:
from sklearn.preprocessing import LabelEncoder
cabinEncoder = LabelEncoder()
train['Cabin'] = cabinEncoder.fit_transform(train['Cabin'])

In [9]:
print(train['Embarked'].value_counts(dropna=False))
train['Embarked'] = train['Embarked'].fillna('S')

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64


In [10]:
embarkedEncoder = LabelEncoder()
train['Embarked'] = embarkedEncoder.fit_transform(train['Embarked'])
print(train['Embarked'].value_counts(dropna=False))

2    646
0    168
1     77
Name: Embarked, dtype: int64


In [11]:
train[['Age','SibSp','Parch','Fare']].describe()

Unnamed: 0,Age,SibSp,Parch,Fare
count,714.0,891.0,891.0,891.0
mean,29.699118,0.523008,0.381594,32.204208
std,14.526497,1.102743,0.806057,49.693429
min,0.42,0.0,0.0,0.0
25%,20.125,0.0,0.0,7.9104
50%,28.0,0.0,0.0,14.4542
75%,38.0,1.0,0.0,31.0
max,80.0,8.0,6.0,512.3292


In [12]:
train['Age'].isna().value_counts()

False    714
True     177
Name: Age, dtype: int64

As Age is a continuos variable, we can impute it with a distribution parameter such as max, min, mean, median, etc. We pick Mean

In [13]:
train['Age'] = train['Age'].fillna(train['Age'].mean())
train['Age'].isna().value_counts()

False    891
Name: Age, dtype: int64

## 1. Filter method ##
<a id="filter">

LDA and Chi-Square test are used for classification tasks

In [82]:
from sklearn.feature_selection import chi2
model = chi2(train[['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']],y) 

In [83]:
print('Chi2 of each predictor: {}'.format(model[0]))

Chi2 of each predictor: [3.08736994e+01 1.70348127e+02 2.46879258e+01 2.58186538e+00
 1.00974991e+01 4.51831909e+03 5.47557842e+02 1.02025247e+01]


In [84]:
print('P-values of each predictor: {}'.format(model[1]))

P-values of each predictor: [2.75378563e-008 6.21058490e-039 6.74051416e-007 1.08094210e-001
 1.48470676e-003 0.00000000e+000 4.27819679e-121 1.40248517e-003]


Observation:

**Fare** is most significant followed by **Cabin** and lastly **Sex**. They have high chi2 statistic scores(~4518, ~547 and ~170) and also less p-value(0, 4.2e-121, 6.2e-39). P-values here mean what is the likelihood that there is no relation between predictor and response variable and the significance occured by chance 

## 2. Wrapper method ##
<a id="wrapper">

a. **Forward Selection**: A trainable model is used as a wrapper to suggest significant features. Let's use a simple Decision Tree. We start from one most significant feature and keep adding incrementally

In [42]:
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']

In [77]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

seed = 7

# Let's split the train dataset into train and validation
X_train, X_validation, y_train, y_validation = train_test_split(train[features],y, random_state=seed) 

from sklearn import metrics

desired_feature_subset_size = 3
feature_subset = []

for i in range(desired_feature_subset_size):
    max_accuracy = 0
    for feature in features:
        
        if feature not in feature_subset:            
            clf = KNeighborsClassifier()
            clf.fit(X_train[feature_subset+[feature]], y_train)

            predictions = clf.predict(X_validation[feature_subset+[feature]])
            accuracy = metrics.accuracy_score(y_validation, predictions)
            
            
            if accuracy > max_accuracy:
                max_accuracy = accuracy
                best_feature = feature
                
            
    
    feature_subset.append(best_feature)
    print('Subset Size: {}. Features: {}. Accuracy: {}'.format(i+1, feature_subset,max_accuracy))
        

Subset Size: 1. Features: ['Pclass']. Accuracy: 0.7443946188340808
Subset Size: 2. Features: ['Pclass', 'Fare']. Accuracy: 0.7443946188340808
Subset Size: 3. Features: ['Pclass', 'Fare', 'Sex']. Accuracy: 0.8026905829596412


Observation:
1. According to Forward selection using Nearest Neighbour model as wrapper, **Pclass, Fare, Sex** are 3 most significant

b. **Backward Selection**

We can follow reverse approach as forward. To be precise, after going through every feature, for each subset size,
we will discard the least performing predictor.

*Recursive Feature Elimination* is a popular backward feature selection algorithm. How it works?, it trains a model on all the features, finds the importance of each feature, discards the least important. This process is repeated until desired subset size of important features is obtained

## 3. Embedded Method ##
<span id="embedded"></span>

Essentially, this uses both filter and wrapper method

There are certain types of classifiers that follow their own search strategy. Some add regularization to the co-efficients
of lesser important features, some assign importance rankings based on the entropy/ information gain the features offer.

*Decision Tree* is one such classifier

In [80]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

seed = 7

# Let's split the train dataset into train and validation
X_train, X_validation, y_train, y_validation = train_test_split(train[features],y, random_state=seed) 

from sklearn import metrics

         
clf = DecisionTreeClassifier()
clf.fit(X_train[features], y_train)

predictions = clf.predict(X_validation[features])
accuracy = metrics.accuracy_score(y_validation, predictions)
print('Accuracy: {}'.format(accuracy))

Accuracy: 0.7443946188340808


In [81]:
[(features[index], score) for index, score in enumerate(clf.feature_importances_)]

[('Pclass', 0.07311698963103617),
 ('Sex', 0.3468378688684211),
 ('Age', 0.187042418500458),
 ('SibSp', 0.07450516823077591),
 ('Parch', 0.023583566802144865),
 ('Fare', 0.18277028351802715),
 ('Cabin', 0.09011474619023904),
 ('Embarked', 0.022028958258897717)]

Observation:

The DecisionTree classifier treats **Sex, Age and Fare** to be 3 most important features