# Logistic Regression Modeling

We've explored the Titanic data set by looking at a variety of features. We learned that gender and age played the biggest influences on the probability of survival.

### Why Logistic Regression?

First and foremost, we must decide what we want out of the data set before we begin (the whole supervised learning thing right?). In this case, we want to predict whether or not a passenger will survive the sinking of the Titanic or not based on certain features like 'Age', 'Sex', 'Class', etc. So we know this will be a classification problem. Of course that leaves a few possible algorithms such as decision trees and ensemble methods but I wanted to first start off with Logistic Regression. Despite it's name, for the way in which we will be using it, it is essentially a classification algorithm.

### How does it work?

The Logistic Function is defined as follows \begin{equation*}
P(t)= \frac{1}{1+e^{-t} }
\end{equation*}

It will take any input from negative infinity to infinity and return values ranging between 0 and 1. To see why, lets look at the parameter 't'. If we make t small, the denominator e^(-t) will be so large that the function P will be close to 0. Likewise, if we make t large, the denominator will be so small that P will be closer to 1. In other words, as t gets larger, P gets smaller and vice versa

### How does this help us?

We can substitute 't' with a linear combination of our features. Therefore we would have: \begin{equation*} P(t)= \frac{1}{1+e^{- (  \beta{0}+\beta{1}x) } } \end{equation*}

This allows us to have a linear combination of our predictors much like we would with linear regression BUT our model instead will produce values between 1 and 0 unlike traditional linear regression. This is perfect for our classification problem. So lets get started!

### Imports

In [33]:
# Data Imports
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

# Math
import math

# Plot imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.graph_objs as go

# Machine Learning Imports
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

# For evaluating our ML results
from sklearn import metrics

# Dataset Import
titanic_df = pd.read_csv('train.csv')

### Preparing the data 

lets take a look at what we have:

In [26]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's take care of any missing age

In [34]:
#Use the built in pandas function "interpolate" to fill in the missing values

titanic_df['Age'] = titanic_df['Age'].interpolate()

### Extracting titles from the Names variable

Lets extract the title from the names variable to aid our model building.

In [35]:
def get_title(name):
    if '.' in name:
        return name.split(',')[1].split('.')[0].strip()
    else:
        return 'Unknown'
    
titanic_df['Title'] = titanic_df['Name'].map(lambda x: get_title(x))

Here we'll group the titles together based on similiarities:

In [36]:
def replace_titles(x):
    title = x['Title']
    if title in ['Capt', 'Col', 'Don', 'Jonkheer', 'Major', 'Rev', 'Sir','Master']:
        return 'Sir'
    elif title in ['the Countess', 'Mme', 'Lady']:
        return 'Madam'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title =='Dr':
        if x['Sex']=='male':
            return 'Sir'
        else:
            return 'Mrs'
    else:
        return title
    
titanic_df['Title'] = titanic_df.apply(replace_titles, axis=1)

Which leaves us with the following titles:

In [37]:
print(titanic_df.Title.value_counts())

Mr       517
Miss     185
Mrs      126
Sir       60
Madam      3
Name: Title, dtype: int64


### Creating Dummy Variables

We'll need to convert Sex, Pclass, and Title into a numerical variable as Scikit-learn only handles real values. For this we will be using dummy variables; that is transposing each option into a new binary variable. For example, in our Sex column, we will be creating two new variables: Male and Female, both will take on a 1 or 0 if they are male or female respectively.

In [38]:
dummies = []
cols = ['Pclass','Sex','Title']
for col in cols:
    dummies.append(pd.get_dummies(titanic_df[col]))

In [39]:
dummy_vars = pd.concat(dummies, axis=1)

In [40]:
dummy_vars.head()

Unnamed: 0,1,2,3,female,male,Madam,Miss,Mr,Mrs,Sir
0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


Now we append this data frame into our titanic_df data frame

In [44]:
titanic_df = pd.concat((titanic_df,dummy_vars),axis=1)

### let's see what we have now:

In [45]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,1,2,3,female,male,Madam,Miss,Mr,Mrs,Sir
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


### Selecting features to include in our model

We'll duplicate our data frame and drop some columns which we won't be including in our model

In [156]:
X = titanic_df

In [157]:
X = X.drop(['Pclass','Sex','Embarked','Name','Ticket','Fare','Survived','Cabin','Title'],axis=1)

Why did we drop Fare? This is because we must keep a look out for the issue of multicollinearity. That is, when two or more features within a data set are correlated. This is bad for us because it prevents us from seeing how significant one predictor is from the other when an increase in one tends to follow that the other increases as well. For example, if X and Z are regressed on Y but X and Z are highly correlated, it would be hard to us to see how much significant X has and the same for Z on Y since they behave very similarly. Fare is correlated with with class, since higher class passengers paid a higher fare while lower class passengers paid a lower fare.

In [61]:
# Set the Y as the target variable

Y = titanic_df.Survived

In [62]:
Y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

### Now we need to flatten the data frame into a 1 dimensional array for Scikit-learn

In [63]:
# Flatten array
Y = np.ravel(Y)

# Logistic Regression with SciKit-Learn

Fantastic! Now we can finally model our data. Lets get started.

### Cross Validation

We will split our data into training and test subsets. This is a technique known as cross validation, it prevents us from overfitting our model. We will use Scikit-learn's built in train_test_split function

In [64]:
# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

# Make logistic regression model
log_model = LogisticRegression()

# Fit the model
log_model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Now let's use our model to predict on the test set

In [65]:
# Predict the classes of the testing data set
class_predict = log_model.predict(X_test)

# Compare the predicted classes to the actual test classes
print metrics.accuracy_score(Y_test,class_predict)

0.829596412556


## Model Accuracy: 83% 

So our logistic regression model gave us a 83% accuracy. This is very good! Note however that this was done ONLY on the training set that Kaggle supplied. Lets run it on their test set.

# Applying our logistic regression model to the Titanic Kaggle test set

In [137]:
test_df = pd.read_csv('test.csv')



In [138]:
#fill in missing ages
test_df['Age'] = test_df['Age'].interpolate()

#get titles and group them together
test_df['Title'] = test_df['Name'].map(lambda x: get_title(x))
test_df['Title'] = test_df.apply(replace_titles, axis=1)

In [139]:
dummies_test = []
cols = ['Pclass','Sex','Title']
for col in cols:
    dummies_test.append(pd.get_dummies(test_df[col]))

In [140]:
dummy = pd.concat(dummies_test, axis=1)
dummy.head()

Unnamed: 0,1,2,3,female,male,Dona,Miss,Mr,Mrs,Sir
0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [141]:
test_df = pd.concat((test_df,dummy),axis=1)

In [146]:
test_data = test_df

In [148]:
test_data = test_data.drop(['Pclass','Sex','Embarked','Name','Ticket','Fare','Cabin','Title'],axis=1)

In [149]:
test_data.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,1,2,3,female,male,Dona,Miss,Mr,Mrs,Sir
0,892,34.5,0,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,893,47.0,1,0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,894,62.0,0,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,895,27.0,0,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,896,22.0,1,1,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


### Apply our model!

In [151]:
kaggle_predict = log_model.predict(test_data)

#### Create a data frame to store our new prediction

In [152]:
submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": kaggle_predict
    })

Export our submission to a CSV file so we can upload it to Kaggle

In [153]:
submission.to_csv('log_reg_model_submission.csv', index=False)

# Outcome

After submitting our model to Kaggle, it returned with an accuracy of 77% which isnt bad. The leaderboards are calculated with only 50% of the data however so that is something to keep in mind. I will be trying other algorithms as well, so look out for those notebooks. Thanks for following along!