# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](../data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

## Step 1: Read the data into a Pandas dataframe

In [2]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
# Read the data into a Panda's dataframe and display the head of the file.  Use PassengerID as the index_col
path = '../data/'
url = path + 'titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

## Step 2: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [4]:
feature_cols = ['Pclass', 'Parch']
X = titanic[feature_cols]
y = titanic.Survived

## Step 3: Split the data into training and testing sets

In [5]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [6]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
zip(feature_cols, logreg.coef_[0])

[('Pclass', -0.88188860564511296), ('Parch', 0.34239215857498861)]

## Step 5: Make predictions on the testing set and calculate the accuracy

In [7]:
# class predictions (not predicted probabilities)
y_pred_class = logreg.predict(X_test)


In [8]:
# calculate classification accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.668161434978


## Step 6: Compare your testing accuracy to the null accuracy

In [9]:
# this only works for binary classification problems coded as 0/1
max(y_test.mean(), 1 - y_test.mean())

0.5739910313901345