<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Feature Selection on the Titanic Dataset

_Authors: Joseph Nelson (SF)_

---

In this lab you will explore a variety of different feature selection methods in sklearn. You will be using the titanic dataset.

You can load the titanic data as follows:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

Or alternatively load the dataset from the local folder:

    ./datasets/titanic_train.csv
    

## Some useful feature selection resources

---

- Michigan State Overview on [feature selection](http://www.cse.msu.edu/~cse802/Feature_selection.pdf) and (bonus) Texas A&M on [bidrectional feature selection](http://research.cs.tamu.edu/prism/lectures/pr/pr_l11.pdf)
- Sklearn documentation on [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)
- Side-by-side comparison of [feature selection tactics](http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Import the data and perform EDA. Engineer any features you think are predictive of survival.

We'll be working with the titanic datasets - go ahead and import it from the dataset folder (or query for it as described above). 

In [2]:
df = pd.read_csv('./datasets/titanic_train.csv')

In [3]:
# A:
df.drop(df[df.Age.isnull()].index.values, axis = 0, inplace = True)

In [4]:
# changing male to 1 and female to 0
df.Sex = [1 if i == 'male' else 0 for i in df.Sex ]

In [5]:
# dropping column 'Cabin' because there are too many nulls
df.drop('Cabin', axis = 1, inplace = True)

In [6]:
# dropping rows where column 'Embarked' is null
df.drop(df[df.Embarked.isnull()].index.values, inplace = True, axis = 0)

In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,S


In [8]:
df_embarked = pd.get_dummies(df.Embarked)
df_pclass = pd.get_dummies(df.Pclass)
df2 = pd.concat([df, df_embarked, df_pclass], axis = 1 )
df2.drop('Pclass', inplace = True, axis = 1)
df2.drop('Embarked', inplace = True, axis = 1)

In [9]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 15 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Name           712 non-null object
Sex            712 non-null int64
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Ticket         712 non-null object
Fare           712 non-null float64
C              712 non-null uint8
Q              712 non-null uint8
S              712 non-null uint8
1              712 non-null uint8
2              712 non-null uint8
3              712 non-null uint8
dtypes: float64(2), int64(5), object(2), uint8(6)
memory usage: 59.8+ KB


In [10]:
df2.rename( columns = {1: 'Pclass_1', 2: 'Pclass_2', 3: 'Pclass_3'}, inplace = True)

In [11]:
df2.head(1)

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,C,Q,S,Pclass_1,Pclass_2,Pclass_3
0,1,0,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,0,0,1,0,0,1


### 2. Set up predictor and target matrices

- target should be `Survived`
- predictor matrix will be all other variables

In [12]:
# A:
y = df2.Survived
X = df2[['Sex', 'Age', 'SibSp', 'Fare', 'Parch', 'C', 'Q', 'S', 'Pclass_1', 'Pclass_2', 'Pclass_3']]

### 3. Feature selection

Let's use the `SelectKBest` method in scikit learn to see which are the top 5 features. Also load the `f_classif` and `chi2` functions which will be our metrics to evaluate what makes a variable the "best".

```python
from sklearn.feature_selection import SelectKBest, f_classif, chi2
```

- What are the top 5 features for `X` using `f_classif`?
- What are the top 5 features for `X` using `chi2`?


> The f-test is explained variance divided by unexplained variance. High numbers will results if our explained variance, what we know is much greater than unexplained, what we dont know. The Chi2 goodness of fit is the sum of the difference squared between observed and expected divided by expected.

In [13]:
df2.columns.values

array(['PassengerId', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'C', 'Q', 'S', 'Pclass_1', 'Pclass_2', 'Pclass_3'], dtype=object)

In [15]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif

# A:
feature_names =['Sex', 'Age', 'SibSp', 'Fare', 'Parch', 'C', 'Q', 'S', 'Pclass_1', 'Pclass_2', 'Pclass_3']
selector = SelectKBest(f_classif, k = 5)
selector.fit(X, y)
mask = selector.get_support(indices=False)
selected_features_a = [feature for bool, feature in zip(mask, feature_names) if bool]
selected_features_a

['Sex', 'Fare', 'C', 'Pclass_1', 'Pclass_3']

### 4. Recursive Feature Elimination (RFE)

Sklearn also offers recursive feature elimination as a class named `RFECV`. Use it in combination with a logistic regression model to see what features would be kept with this method.

When instantiating the `RFECV`:
- `step` indicates what percent of features (or number of features if an integer) to remove at each iteration.
- `cv` indicates the number of cross-validation folds to use for evaluating what features are important.

Store the columns in a variable called `rfecv_columns`.

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

# A:

### 5. Feature elimination using the lasso penalty

The L1 penalty is a popular method for feature selection. As the regularization strength increases more features will be removed.

Load the `LogisticRegressionCV` class.

1. Standardize your predictor matrix (required for regularization!)
- Create a logistic regression cross-validator object :
    - Set `penalty='l1'` (Lasso).
    - Set `Cs=100` (search 100 different regularization strengths).
    - Set `solver='liblinear'` (required for the Lasso penalty).
    - Set `cv=10` for 10 cross-validation folds.
- Fit on the target and standardized predictors.
- Sort the logistic regression coefficients by absolute value. Do the top 5 correspond to those selected by the f-score and chi2?



Choose which ones you would keep and store them in a variable called `lr_columns`

In [None]:
from sklearn.preprocessing import StandardScaler

# A:

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# A:

### 6. Compare features sets

Use the optimized logistic regression from the previous question on the features selected from different methods. 
- `kbest_columns`
- `rfecv_columns`
- `lasso_columns`
- `all_columns`

**Questions:**
- Which scores the highest? (use cross_val_score)
- Is the difference significant?


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [None]:
# A:

### 7. [Bonus] Display the lasso logistic regression coefficients with a barchart.

Start from the most negative on the left.

In [None]:
# A: