## What I want to achieve

Based on my current knowledge of python and machine learning, I want to be able to run an ensemble to algos to predict titanic survival. Gosh, we're still into that. I hope I dont ever have to touch this or the iris dataset in a years time...Thanks to [Bugra Akyildiz](http://bugra.github.io/work/notes/2014-11-22/an-introduction-to-supervised-learning-scikit-learn/) for a great blog on supervised learning. Check it out and weep with joy!

Ok, so what do we need to do? - Binary classification of survival

## Admin stuff

In [9]:
import pandas as pd
pd.set_option('display.max_columns', None) # Display any number of columns

import numpy as np
from matplotlib import pyplot
%matplotlib inline

# Set seaborn aesthetic parameters to defaults
import seaborn as sns
sns.set()

## Read the training dataset

In [10]:
train = pd.read_csv('train.csv')
train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## Data wrangling from the lessons learned during the exploratory session

### Quick explorations

In [30]:
# check the class distribution of the response variable
train.Survived.value_counts()/train.shape[0]*100

0    61.616162
1    38.383838
Name: Survived, dtype: float64

Almost 60:40 split. Looks cool to me. No class imbalance.

In [11]:
# quick test to see if passenger id is indeed unique and can be used as the index
print len(train.PassengerId), len(train.PassengerId.unique())

891 891


In [12]:
# fill up age variable with righteous values i.e. median age
print train.Age.isnull().sum()
train['Age'].fillna(train['Age'].median(), inplace = True)
print train.Age.isnull().sum()

177
0


In [17]:
# check out SibSp and Parch features
print train.SibSp.isnull().sum()
print train.Parch.isnull().sum()
print train.Fare.isnull().sum()
print train.Embarked.isnull().sum()

0
0
0
2


In [19]:
train.Fare.describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [20]:
train.Embarked.value_counts()

# most of them seem to be from Southampton S, so encode the 2 remaining null values as 'S'

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [31]:
# check how many rows have cabin level information
print train.Cabin.isnull().sum(), "pax dont have cabin level information"

# what about ticket info
print train.Ticket.isnull().sum(), "pax are travelling ticketless. However,...."

# however, multiple people could be travelling on the same ticket. Lets see if thats the case
print len(train.Ticket) - len(train.Ticket.drop_duplicates()), "pax are travelling on another passengers tickets."

687 pax dont have cabin level information
0 pax are travelling ticketless. However,....
210 pax are travelling on another passengers tickets.


### Create a function to do the data cleansing for us

In [27]:
def titanic_wrangling(raw_dataset):
    
    # convert the passengerid column into index. I've already checked that this field is unique per row.
    clean_dataset = raw_dataset.set_index('PassengerId', drop=True)
    
    # there's heaps of null ages. I got this from the exploratory session in the previous workbooks. Fill with 
    # median age of the passengers
    clean_dataset['Age'].fillna(clean_dataset['Age'].median(), inplace = True)
    
    # theres 2 null embarked rows. I'm coding these as 'S' fo reasons described above
    clean_dataset['Embarked'].fillna('S', inplace = True)
  
    # Encode string values into discrete integers
    #import & instantiate
    from sklearn import preprocessing
    label_encoder = preprocessing.LabelEncoder()
    # fit & transform SEX variable into discrete integer values
    clean_dataset['Sex'] = label_encoder.fit_transform(clean_dataset['Sex'])
    # fit & transform EMBARKED variable into discrete integer values
    clean_dataset['Embarked'] = label_encoder.fit_transform(clean_dataset['Embarked'])
    # fit & transform TICKET variable into discrete integer values. Given 210 pax have dup tickets, this could be 
    # predictive.
    clean_dataset['Ticket'] = label_encoder.fit_transform(clean_dataset['Ticket'])
    
    # Get the response variable as a numpy matrix
    label = clean_dataset['Survived'].as_matrix().astype(int)
    
    # Drop redundant columns from the clean dataset
    # note that I'm removing the cabin information for now, because there is too much scope for overfitting here 
    # notice also that I'm dropping Survived because I've already created the response variable
    clean_dataset.drop(['Name','Cabin','Survived'], axis=1, inplace=True)
    
    features = clean_dataset.as_matrix().astype(np.float)
    
    return features, label

__Future Improvement #1:__ Note here that encoding categorical features into integers has introduced implicit ordering in the Sex and Ticket variable. Strings are not inherently ordered, but integers are. We could transform these string features by creating dummy features. This can be achieved by: <br\> a) using sklearn.preprocessing.OneHotEncoder, or <br\> b) using the pandas method pd.get_dummies()

### Use the data wrangling function to give us our features and labels

In [28]:
X, y = titanic_wrangling(train)
print X.shape, y.shape

(891, 8) (891,)


In [31]:
X

array([[   3.    ,    1.    ,   22.    , ...,  523.    ,    7.25  ,    2.    ],
       [   1.    ,    0.    ,   38.    , ...,  596.    ,   71.2833,    0.    ],
       [   3.    ,    0.    ,   26.    , ...,  669.    ,    7.925 ,    2.    ],
       ..., 
       [   3.    ,    0.    ,   28.    , ...,  675.    ,   23.45  ,    2.    ],
       [   1.    ,    1.    ,   26.    , ...,    8.    ,   30.    ,    0.    ],
       [   3.    ,    1.    ,   32.    , ...,  466.    ,    7.75  ,    1.    ]])

Looks like the features are clearly in need of some scaling and standardisation...

TBC...