# "Titanic Prediction"
> "Who will survive... let's find out!"
- toc: false
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

The purpose of this project is to see the score that I can get on the Titanic Prediction Kaggle competition by creating a very simple random forest. The only thing i will do is deal with missing values, convert categorical variables to numbers and some simple feature engineering.

Load in libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Load in the train and test datasets

In [3]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [4]:
df = pd.concat([train,test],keys=['train','test'],sort=False)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 108.0+ KB


In [6]:
df.isnull().sum()/df.isnull().count()

PassengerId    0.000000
Survived       0.319328
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.200917
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000764
Cabin          0.774637
Embarked       0.001528
dtype: float64

20% of the age variable is missing, this is a lot of data to lose so i will start by filling them with the median value. I'll ignore the Cabin field for now, it's likely to be useful as a social class identifier but needs work to make it useful. I'll one hot encode the Sex and Embarked variables including the missing values as they may have useful information.

In [7]:
df.Age = df.Age.fillna(df.Age.median())
df.Fare = df.Fare.fillna(df.Fare.median())

In [8]:
df = pd.get_dummies(df,columns=['Sex','Embarked'],drop_first=True,dummy_na=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 15 columns):
PassengerId     1309 non-null int64
Survived        891 non-null float64
Pclass          1309 non-null int64
Name            1309 non-null object
Age             1309 non-null float64
SibSp           1309 non-null int64
Parch           1309 non-null int64
Ticket          1309 non-null object
Fare            1309 non-null float64
Cabin           295 non-null object
Sex_male        1309 non-null uint8
Sex_nan         1309 non-null uint8
Embarked_Q      1309 non-null uint8
Embarked_S      1309 non-null uint8
Embarked_nan    1309 non-null uint8
dtypes: float64(3), int64(4), object(3), uint8(5)
memory usage: 104.1+ KB


### Train Random Forest

In [10]:
X = df[['Pclass','Age','SibSp','Parch','Fare','Sex_male','Sex_nan','Embarked_Q','Embarked_S','Embarked_nan']]
y = df['Survived']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X[:'train'],y[:'train'],test_size=0.3)

In [12]:
clf = RandomForestClassifier(n_jobs=-1)
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [13]:
clf.score(X_train,y_train)

0.9662921348314607

In [14]:
clf.score(X_test,y_test)

0.7873134328358209

In [15]:
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
Fare,0.305101
Age,0.253913
Sex_male,0.217189
Pclass,0.077806
SibSp,0.051517
Parch,0.045349
Embarked_S,0.026794
Embarked_Q,0.021469
Embarked_nan,0.000862
Sex_nan,0.0


In [17]:
test['Survived']= clf.predict(X.loc['test']).astype(int)

In [18]:
submission = test[['PassengerId','Survived']]

In [19]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0


In [None]:
submission.to_csv('./data/submission.csv',index=False)

This submission received an accuracy score of 0.75598, it's in the top 85%! Not so great but a point to start from.

### Feature Engineering

Age is the second most important variable in the baseline model and 20% of the values were missing. I used the median of the whole dataset to fill in these values however I think it's worthwhile looking to see if another variable is a good predictor of age to give a better estimation.

In [73]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
df = pd.concat([train,test],keys=['train','test'],sort=False)

In [74]:
corrMatrix = df.corr()

In [75]:
corrMatrix

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.038354,0.028814,-0.055224,0.008942,0.031428
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.038354,-0.338481,1.0,-0.408106,0.060832,0.018322,-0.558629
Age,0.028814,-0.077221,-0.408106,1.0,-0.243699,-0.150917,0.17874
SibSp,-0.055224,-0.035322,0.060832,-0.243699,1.0,0.373587,0.160238
Parch,0.008942,0.081629,0.018322,-0.150917,0.373587,1.0,0.221539
Fare,0.031428,0.257307,-0.558629,0.17874,0.160238,0.221539,1.0


In [53]:
df.head().transpose()

Unnamed: 0_level_0,train,train,train,train,train
Unnamed: 0_level_1,0,1,2,3,4
PassengerId,1,2,3,4,5
Survived,0,1,1,1,0
Pclass,3,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry"
Sex,male,female,female,female,male
Age,22,38,26,35,35
SibSp,1,1,0,1,0
Parch,0,0,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450
Fare,7.25,71.2833,7.925,53.1,8.05


The variables with the strongest correlations to age are Pclass, SibSp and Parch. There are titles in the name variable, this might be a useful predictor for age.

The title begins 2 characters after the ',' and ends before a '.', let's use this to extract the title from the name.

In [99]:
df['Title'] = df['Name'].str.split(', ').str[1]
df['Title'] = df.Title.str.split('.').str[0]
df['Title']

train  0          Mr
       1         Mrs
       2        Miss
       3         Mrs
       4          Mr
       5          Mr
       6          Mr
       7      Master
       8         Mrs
       9         Mrs
       10       Miss
       11       Miss
       12         Mr
       13         Mr
       14       Miss
       15        Mrs
       16     Master
       17         Mr
       18        Mrs
       19        Mrs
       20         Mr
       21         Mr
       22       Miss
       23         Mr
       24       Miss
       25        Mrs
       26         Mr
       27         Mr
       28       Miss
       29         Mr
               ...  
test   388        Mr
       389    Master
       390        Mr
       391       Mrs
       392    Master
       393        Mr
       394        Mr
       395       Mrs
       396        Mr
       397       Mrs
       398        Mr
       399        Mr
       400      Miss
       401        Mr
       402      Miss
       403        Mr
       404   

In [100]:
df.groupby('Title')['Age'].median()

Title
Capt            70.0
Col             54.5
Don             40.0
Dona            39.0
Dr              49.0
Jonkheer        38.0
Lady            48.0
Major           48.5
Master           4.0
Miss            22.0
Mlle            24.0
Mme             24.0
Mr              29.0
Mrs             35.5
Ms              28.0
Rev             41.5
Sir             49.0
the Countess    33.0
Name: Age, dtype: float64

In [96]:
df['Age2'] = df.groupby(['Pclass','SibSp'])['Age'].transform('median')
df['Age3'] = df.groupby(['Pclass','SibSp','Parch'])['Age'].transform('median')
df['Age4'] = df.groupby(['Title'])['Age'].transform('median')
df['Age5'] = df.groupby(['Title','Pclass'])['Age'].transform('median')
df['Age6'] = df.groupby(['Title','Pclass','SibSp'])['Age'].transform('median')
df['Age7'] = df.groupby(['Title','Pclass','SibSp','Parch'])['Age'].transform('median')

In [97]:
corrMatrix = df.corr()

In [98]:
corrMatrix

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Age2,Age3,Age4,Age5,Age6,Age7
PassengerId,1.0,-0.005007,-0.038354,0.028814,-0.055224,0.008942,0.031428,0.070793,0.068732,0.032094,0.050506,0.062236,0.067255
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,0.315558,0.198084,-0.088973,0.034404,0.039313,0.01717
Pclass,-0.038354,-0.338481,1.0,-0.408106,0.060832,0.018322,-0.558629,-0.867278,-0.749287,-0.203441,-0.699249,-0.656503,-0.597002
Age,0.028814,-0.077221,-0.408106,1.0,-0.243699,-0.150917,0.17874,0.47709,0.582066,0.537804,0.644282,0.676792,0.729296
SibSp,-0.055224,-0.035322,0.060832,-0.243699,1.0,0.373587,0.160238,-0.402071,-0.351738,-0.252419,-0.199532,-0.32369,-0.295579
Parch,0.008942,0.081629,0.018322,-0.150917,0.373587,1.0,0.221539,-0.151732,-0.178619,-0.154623,-0.139961,-0.177089,-0.18028
Fare,0.031428,0.257307,-0.558629,0.17874,0.160238,0.221539,1.0,0.459531,0.386105,0.019448,0.347054,0.295632,0.278135
Age2,0.070793,0.315558,-0.867278,0.47709,-0.402071,-0.151732,0.459531,1.0,0.850933,0.334066,0.730562,0.743608,0.675899
Age3,0.068732,0.198084,-0.749287,0.582066,-0.351738,-0.178619,0.386105,0.850933,1.0,0.429045,0.750214,0.769857,0.813155
Age4,0.032094,-0.088973,-0.203441,0.537804,-0.252419,-0.154623,0.019448,0.334066,0.429045,1.0,0.780736,0.74289,0.69644


Age7 has the highest correlation with Age so let's use this to fill in the missing values

In [109]:
df.groupby('Title')['Survived'].mean()

Title
Capt            0.000000
Col             0.500000
Don             0.000000
Dona                 NaN
Dr              0.428571
Jonkheer        0.000000
Lady            1.000000
Major           0.500000
Master          0.575000
Miss            0.697802
Mlle            1.000000
Mme             1.000000
Mr              0.156673
Mrs             0.792000
Ms              1.000000
Rev             0.000000
Sir             1.000000
the Countess    1.000000
Name: Survived, dtype: float64

In [108]:
df['Deck'] = df.Cabin.str.slice(0,1,1)

### Data Cleaning 2

In [112]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 20 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Title          1309 non-null object
Age2           1309 non-null float64
Age3           1307 non-null float64
Age4           1309 non-null float64
Age5           1308 non-null float64
Age6           1301 non-null float64
Age7           1298 non-null float64
Deck           295 non-null object
dtypes: float64(9), int64(4), object(7)
memory usage: 179.6+ KB


In [123]:
df.Age = df.Age.fillna(df.Age7)
df.Age = df.Age.fillna(df.Age6)
df.Age = df.Age.fillna(df.Age5)
df.Age = df.Age.fillna(df.Age4)
df.Age = df.Age.fillna(df.Age.median())
df.Fare = df.Fare.fillna(df.Fare.median())

In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 20 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Title          1309 non-null object
Age2           1309 non-null float64
Age3           1307 non-null float64
Age4           1309 non-null float64
Age5           1308 non-null float64
Age6           1301 non-null float64
Age7           1298 non-null float64
Deck           295 non-null object
dtypes: float64(9), int64(4), object(7)
memory usage: 179.6+ KB


In [125]:
df = pd.get_dummies(df,columns=['Sex','Embarked','Title','Deck'],drop_first=True,dummy_na=True)

In [127]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 47 columns):
PassengerId           1309 non-null int64
Survived              891 non-null float64
Pclass                1309 non-null int64
Name                  1309 non-null object
Age                   1309 non-null float64
SibSp                 1309 non-null int64
Parch                 1309 non-null int64
Ticket                1309 non-null object
Fare                  1309 non-null float64
Cabin                 295 non-null object
Age2                  1309 non-null float64
Age3                  1307 non-null float64
Age4                  1309 non-null float64
Age5                  1308 non-null float64
Age6                  1301 non-null float64
Age7                  1298 non-null float64
Sex_male              1309 non-null uint8
Sex_nan               1309 non-null uint8
Embarked_Q            1309 non-null uint8
Embarked_S            1309 non-null uint8
Embarked_nan      

### Train Random Forest 2

In [129]:
X = df.drop(['PassengerId','Survived','Name','Ticket','Cabin','Age2','Age3','Age4','Age5','Age6','Age7'],axis=1)
y = df['Survived']

In [130]:
X.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 36 columns):
Pclass                1309 non-null int64
Age                   1309 non-null float64
SibSp                 1309 non-null int64
Parch                 1309 non-null int64
Fare                  1309 non-null float64
Sex_male              1309 non-null uint8
Sex_nan               1309 non-null uint8
Embarked_Q            1309 non-null uint8
Embarked_S            1309 non-null uint8
Embarked_nan          1309 non-null uint8
Title_Col             1309 non-null uint8
Title_Don             1309 non-null uint8
Title_Dona            1309 non-null uint8
Title_Dr              1309 non-null uint8
Title_Jonkheer        1309 non-null uint8
Title_Lady            1309 non-null uint8
Title_Major           1309 non-null uint8
Title_Master          1309 non-null uint8
Title_Miss            1309 non-null uint8
Title_Mlle            1309 non-null uint8
Title_Mme             1309 non-nu

In [131]:
X_train, X_test, y_train, y_test = train_test_split(X[:'train'],y[:'train'],test_size=0.3)
clf = RandomForestClassifier(n_jobs=-1)
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [132]:
clf.score(X_train,y_train)

0.9823434991974318

In [133]:
clf.score(X_test,y_test)

0.8022388059701493

In [134]:
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
Fare,0.25319
Age,0.226708
Title_Mr,0.131942
Sex_male,0.085591
Pclass,0.049262
SibSp,0.049056
Title_Miss,0.043433
Parch,0.0415
Title_Mrs,0.023684
Deck_nan,0.021211


In [135]:
test['Survived']= clf.predict(X.loc['test']).astype(int)
submission = test[['PassengerId','Survived']]
submission.to_csv('./data/submission.csv',index=False)

This model ranked in the top 15% on Kaggle, I'm pretty happy with it given the amount of effort involved!