...welcome back!

# TITANIC: Machine Learning from Disaster

## Introduction to Scikit-Learn

This tutorial is based on the Kaggle Competition,
"Predicting Survival Aboard the Titanic"    
https://www.kaggle.com/c/titanic    

As well as the following tutorials:    
https://www.kaggle.com/mlchang/titanic/logistic-model-using-scikit-learn/run/91385    
https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests    
https://github.com/savarin/pyconuk-introtutorial/tree/master/notebooks    

See also:    
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html    
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html    
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html    
http://scikit-learn.org/stable/modules/svm.html    

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas.io.sql as pd_sql
import sqlite3 as sql

%matplotlib inline

In [73]:
# re-establish database connection
con = sql.connect("titanic.db") 

# extract everything from the 'training_data' table (or whatever you called it) into a dataframe
df = pd_sql.read_sql('select * from training_data', con, index_col='index')

In [74]:
#Is it all still here?
df.head(5)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,8.05,,S


### Let's start by creating a model that predicts purely based on gender

In [75]:
# Set some variables
number_passengers = df.shape[0] 
number_survived = len(df[df.Survived == 1])

# What proportion of the passengers survived?
proportion_survived = float(number_survived) / number_passengers
print 'The proportion of passengers who survived is %s.' % proportion_survived

The proportion of passengers who survived is 0.383838383838.


In [76]:
# How can we determine what proportion of the women and of the men who survived?
# Let's start by segregating the men and women
women = df[df.Sex == "female"]
men = df[df.Sex != "female"]

# Determine the proportion of women who survived
proportion_women_survived = float(len(women[women.Survived == 1])) / len(women)
print 'The proportion of women who survived is %s.' % proportion_women_survived

# Determine the proportion of men who survived
proportion_men_survived = float(len(men[men.Survived == 1])) / len(men)
print 'The proportion of men who survived is %s.' % proportion_men_survived

The proportion of women who survived is 0.742038216561.
The proportion of men who survived is 0.188908145581.


So we know that women were MUCH more likely to survive, and we _could_ just say that our model is:
    - if female, survived = 1
    - if male, survived = 0
    
But that means our predictions are going to wrong for about a quarter of the women and a fifth of the men. Let's use the Python library Scikit-learn to see if we can do a little better than that!

In [77]:
# But first, we need to do some more wrangling. 'Sex' is stored as a text value. We 
# should convert (or 'map') it into numeric binaries so it will be ready for scikit-learn.
df['Sex'] = df['Sex'].map({'male': 0,'female': 1})

In [78]:
# Scikit-learn won't be tolerant of the missing values. In the last class, we dropped
# the 'Ticket' column. Let's also drop the 'Cabin' and 'Embarked' columns
df = df.drop(['Cabin'], axis=1)
df = df.drop(['Embarked'], axis=1)

In [79]:
# Let's also drop the 'Name' column for now (though I can think of some interesting 
# data that might be embedded in those salutations...)
df = df.drop(['Name'], axis=1)

Ok, we've got a table of purely numeric data with no null values. We're ready to go.

## Using Scikit-learn
### LOGISTIC REGRESSION

https://www.kaggle.com/mlchang/titanic/logistic-model-using-scikit-learn/run/91385

A logistic regression mathematically calculates the decision boundary between the possibilities. It looks for a straight line that represents a cutoff that most accurately represents the training data.

In [80]:
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

In [82]:
# Load the test data
test = pd.read_csv("../titanic/data/test.csv") 
test["Age"] = test["Age"].fillna(df["Age"].median())

test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1

test = test.drop(['Cabin'], axis=1)
test = test.drop(['Embarked'], axis=1)
test = test.drop(['Name'], axis=1)
test = test.drop(['Ticket'], axis=1)

In [90]:
# Initialize our algorithm
alg = LogisticRegression(random_state=1)

In [93]:
# Define our predictors
predictors = ["Pclass", "Sex", "Age", "SibSp"]

# Train the algorithm using all the training data
alg.fit(df[predictors], df["Survived"])

# Make predictions using the test set
predictions = alg.predict(test[predictors])

In [94]:
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": predictions
    })
submission.head(5)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


Well that was easy! The trouble is that, since we didn't evaluate the success of 
our predictive model, we don't know how good our predictions will be.    

Let's go back and do some cross-validation splits.    

More on cross-validation tools inside Scikit-learn here:    
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

In [101]:
from sklearn.cross_validation import train_test_split

In [120]:
# X = df[["Pclass", "Sex", "Age", "SibSp"]]
X = df[["Pclass", "Sex", "Age", "SibSp"]]
y = df["Survived"]

In [121]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size = 0.2)

In [122]:
X_train.shape, y_train.shape

((712, 4), (712,))

In [123]:
X_test.shape, y_test.shape

((179, 4), (179,))

In [124]:
clf = alg.fit(X_train, y_train)
clf.score(X_test, y_test)

0.79329608938547491

So how good is our model? Some models will work better than others. Let's try another one.

### RANDOM FOREST

In [107]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.preprocessing import LabelEncoder

We want our machine learning algorithm to look for patterns and break our dataset into pieces,
so we need to pick a classifier. Let's begin with a random forest, which is a 'meta estimator'. 
It will fit a number of decision trees (we'll have to tell it how many) on various sub-samples 
of the dataset. Then it will use averaging to improve the predictive accuracy and control over-fitting.

Read more about Random Forests here:    
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [109]:
# We'll select 50 trees and opt for 'out-of-bag' samples to estimate the generalization error.
clf = RandomForestClassifier(n_estimators=50, oob_score=True) 

How many sub-samples should we specify?     

The training samples are constructed by splitting our original dataset into more than one part. But what if certain chunks of our data have more variance than others? We want to ensure that our model is accurate regardless of the particular way the data are divided up.    

In [112]:
# Let's divide the data into 12 parts. Because that's Ben's favorite number.
test_size_percent = 0.083

In [113]:
# Next split up the data with the 'train test split' method in the Cross Validation module
train_signals, test_signals, train_labels, test_labels = train_test_split(signals, labels, test_size=test_size_percent)

# ...and then run the 'fit' method to build a forest of trees
clf.fit(train_signals, train_labels)

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=50, n_jobs=1, oob_score=True,
            random_state=None, verbose=0)

In [114]:
# We can use the Random Forest 'predict probabilities' method on our test signals:
predictions = clf.predict_proba(test_signals)[:,1]

How did we do?

### SVM

https://github.com/rebeccabilbro/pyconuk-introtutorial/blob/master/notebooks/Section%202-2%20-%20SVM%20with%20Parameter%20Tuning.ipynb