# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](../data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

## Read the data into a Pandas dataframe

In [72]:
# Read the data into a Panda's dataframe and display the head of the file.  Use passenger ID as the index_col
import pandas as pd
df = pd.read_csv("../data/titanic.csv", index_col="PassengerId")
df.head(2)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [73]:
import numpy as np
df["Pclass_1st"] = np.where(df.Pclass<2,1,0)

## 1a: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [74]:
feature_cols = ["Pclass","Parch"]
X  = df[feature_cols]
y = df.Survived

## 1b: Split the data into training and testing sets

In [85]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

## 2: Chose and import the estimator

In [86]:
from sklearn.linear_model import LogisticRegression

## 3: Instantiate into a variable

In [87]:
logreg = LogisticRegression()

## 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [97]:
logreg.fit(X_train, y_train)
print logreg.intercept_
zip(feature_cols, logreg.coef_[0])

[ 1.22478177]


[('Pclass', -0.83556069712088044), ('Parch', 0.23408381691656688)]

## 5: Make predictions on the testing set

In [89]:
# class predictions
y_pred = logreg.predict(X_test)

## 6: Calculate the accuracy using the testing set predictions

In [90]:
# calculate classification accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred)

0.67264573991


## Compare your testing accuracy to the null accuracy

In [91]:
#null accuracy is %
y_test.value_counts().head(1)/len(y_test)

0    0.569507
Name: Survived, dtype: float64

## Bonus problem: Build a K Nearest Neighbor model and compare the accuracy

In [99]:
from sklearn.neighbors import KNeighborsClassifier
k_range = range(1, 101)
training_error_rate = []
testing_error_rate = []

for k in k_range:

    # instantiate the model with the current K value
    knn = KNeighborsClassifier(n_neighbors=k)

    # calculate training error
    knn.fit(X_train, y_train)
    y_pred_class = knn.predict(X_test)
    training_accuracy = metrics.accuracy_score(y_test, y_pred_class)
    training_error_rate.append(1 - training_accuracy)
    

In [100]:
min(zip(training_error_rate, k_range)) 

(0.29147982062780264, 9)

In [101]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

0.708520179372
