# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](../data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

## Step 1: Read the data into a Pandas dataframe

In [2]:
# Read the data into a Panda's dataframe and display the head of the file.  Use PassengerID as the index_col
import pandas as pd
path = '../data/'
url = path + 'titanic.csv'
df = pd.read_csv(url)

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [13]:
df[['Survived','Pclass', 'Parch']]


Unnamed: 0,Survived,Pclass,Parch
0,0,3,0
1,1,1,0
2,1,3,0
3,1,1,0
4,0,3,0
5,0,3,0
6,0,1,0
7,0,3,1
8,1,3,2
9,1,2,0


## Step 2: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [22]:
from sklearn.linear_model import LogisticRegression

feature_cols = ['Pclass', 'Parch']
X = df[feature_cols]
y = df.Survived


## Step 3: Split the data into training and testing sets

In [6]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

## Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [23]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print logreg.coef_



[[-0.78656827  0.27138557]]


## Step 5: Make predictions on the testing set and calculate the accuracy

In [24]:
# class predictions
df['test_set'] = logreg.predict(X)

In [25]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,test_set
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0


In [26]:
# calculate classification accuracy
df['test_set_prob'] = logreg.predict_proba(X)[:, 1]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,test_set,test_set_prob
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0.241917
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,0.606098
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,0,0.241917
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1,0.606098
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0.241917


## Step 6: Compare your testing accuracy to the null accuracy

In [27]:
y_test.value_counts().head(1) / len(y_test)

0    0.623318
Name: Survived, dtype: float64

In [31]:
import numpy as np
# create a NumPy array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)

# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null

array([ 0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37668161,
        0.37668161,  0.37668161,  0.37668161,  0.37668161,  0.37