## Class 8 Exercise: Predicting Survival on the Titanic

This assignment uses data from Kaggle's [Titanic](https://www.kaggle.com/c/titanic/data) competition. `titanic.csv` is in the repo, so there is no need to download the data from the Kaggle website.

**Tasks:**

1. Read `titanic.csv` into a DataFrame.
2. What is the null accuracy rate for predicting survival? (recall this means the probability of choosing the largest unique category, either survived or not)
3. Can you think of some variables that are in the dataset that might contribute to predicting survival of the crash?
4. Define Pclass and Parch as the features, and Survived as the response.
5. Split the data into training and testing sets. (Hint: use the train test split modules from sklearn)
6. Fit a logistic regression model and examine the coefficients to confirm that they make intuitive sense.
7. Make predictions on the testing set and calculate the accuracy.
8. Create a confusion matrix and document the model's sensitivity and specificity. (remember you should run metrics on your test classes!)
9. Also include Age as a feature, and calculate the testing accuracy. There will be a small issue you'll have to deal with. What is it? How will you deal with it?
10. Try to make up a new column (be creative!) that you think might be helpful. For example one student make a column called "is_married" which was a combination of SibSp and the name column and include this new column.
11. In any of your models, were you able to beat the null accuracy rate?



Always remember to fit your model on the training data and run metrics on the test set.

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn import metrics
%matplotlib inline

### 1. Read `titanic.csv` into a DataFrame.

In [35]:
df = pd.read_csv('https://raw.githubusercontent.com/sinanuozdemir/sfdat22/master/data/titanic.csv',
                index_col = 'PassengerId')

### 2. What is the null accuracy rate for predicting survival?

In [36]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [37]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [38]:
df.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

The null accuracy rate for predicting survival (choosing the largest unique category) is 549/891 = .6162

### 3. Can you think of some variables that are in the dataset that might contribute to predicting survival of the crash?

In [39]:
df.corr()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
Survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


The variables that are most likely related to survival are PClass, Sex, and Age

### 4. Define Pclass and Parch as the features, and Survived as the response.

In [40]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
feature_cols = ['Pclass', 'Parch']

X = df[feature_cols]
y = df.Survived

### 5. Split the data into training and testing sets. (Hint: use the train test split modules from sklearn)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### 6. Fit a logistic regression model and examine the coefficients to confirm that they make intuitive sense.

In [42]:
logreg.fit(X_train, y_train)

print logreg.intercept_
zip(feature_cols, logreg.coef_[0])

print logreg.coef_
print logreg.coef_[0,1]

[ 1.23165951]
[[-0.84439049  0.3412417 ]]
0.341241699732


In [43]:
odds_pclass = np.exp(logreg.coef_[0,0])
odds_parch = np.exp(logreg.coef_[0,1])

print odds_pclass, odds_parch

0.429819255772 1.40669319715


The coefficients make intuitive sense, going to a lower class decreases your odds of surviving to 42% and having more family on board increases your odds of surviving by 40% since you have more people looking out for you

### 7. Make predictions on the testing set and calculate the accuracy.

In [44]:
y_pred = logreg.predict(X_test)

print metrics.accuracy_score(y_test, y_pred)

0.668161434978


### 8. Create a confusion matrix and document the model's sensitivity and specificity. (remember you should run metrics on your test classes!)

In [45]:
print metrics.confusion_matrix(y_test, y_pred)

[[105  23]
 [ 51  44]]


In [46]:
#### Accuracy    = (105 + 44) / 223      == .66816
#### Sensitivity =  44        / (51 + 44) == .4632
#### Specificity =  105       / (105 + 23) == .8203

### 9. Also include Age as a feature, and calculate the testing accuracy. There will be a small issue you'll have to deal with. What is it? How will you deal with it?

In [51]:
df_clean = df.dropna(subset = ['Pclass', 'Parch', 'Age'])

df_clean.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,714.0,714.0,714.0,714.0,714.0,714.0
mean,0.406162,2.236695,29.699118,0.512605,0.431373,34.694514
std,0.49146,0.83825,14.526497,0.929783,0.853289,52.91893
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,1.0,20.125,0.0,0.0,8.05
50%,0.0,2.0,28.0,0.0,0.0,15.7417
75%,1.0,3.0,38.0,1.0,1.0,33.375
max,1.0,3.0,80.0,5.0,6.0,512.3292


In [52]:
feature_cols = ['Pclass', 'Parch', 'Age']

X = df_clean[feature_cols]
y = df_clean.Survived

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [53]:
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [55]:
zip(feature_cols, logreg.coef_[0])

[('Pclass', -1.0261720595607464),
 ('Parch', 0.22448364565799289),
 ('Age', -0.025327415744739689)]

In [54]:
print metrics.accuracy_score(y_test, y_pred)

0.754189944134


### 10. Try to make up a new column (be creative!) that you think might be helpful. For example one student make a column called "is_married" which was a combination of SibSp and the name column and include this new column.

### 11. In any of your models, were you able to beat the null accuracy rate?