# Kaggle Titanic Prediction

The purpose of this project is to see the score that I can get on the Titanic Prediction Kaggle competition by creating a very simple random forest. The only thing i will do is deal with missing values and convert categorical variables to numbers.

Load in libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Load in the train and test datasets

In [43]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [44]:
df = pd.concat([train,test],keys=['train','test'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 108.0+ KB


In [46]:
df.isnull().sum()/df.isnull().count()

Age            0.200917
Cabin          0.774637
Embarked       0.001528
Fare           0.000764
Name           0.000000
Parch          0.000000
PassengerId    0.000000
Pclass         0.000000
Sex            0.000000
SibSp          0.000000
Survived       0.319328
Ticket         0.000000
dtype: float64

20% of the age variable is missing, this is a lot of data to lose so i will start by filling them with the median value. I'll ignore the Cabin field for now, it's likely to be useful as a social class identifier but needs work to make it useful. I'll one hot encode the Sex and Embarked variables including the missing values as they may have useful information.

In [47]:
df.Age = df.Age.fillna(df.Age.median())
df.Fare = df.Fare.fillna(df.Fare.median())

In [48]:
df = pd.get_dummies(df,columns=['Sex','Embarked'],drop_first=True,dummy_na=True)

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1309 entries, (train, 0) to (test, 417)
Data columns (total 15 columns):
Age             1309 non-null float64
Cabin           295 non-null object
Fare            1309 non-null float64
Name            1309 non-null object
Parch           1309 non-null int64
PassengerId     1309 non-null int64
Pclass          1309 non-null int64
SibSp           1309 non-null int64
Survived        891 non-null float64
Ticket          1309 non-null object
Sex_male        1309 non-null uint8
Sex_nan         1309 non-null uint8
Embarked_Q      1309 non-null uint8
Embarked_S      1309 non-null uint8
Embarked_nan    1309 non-null uint8
dtypes: float64(3), int64(4), object(3), uint8(5)
memory usage: 104.1+ KB


### Train Random Forest

In [50]:
X = df[['Pclass','Age','SibSp','Parch','Fare','Sex_male','Sex_nan','Embarked_Q','Embarked_S','Embarked_nan']]
y = df['Survived']

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X[:'train'],y[:'train'],test_size=0.3)

In [52]:
clf = RandomForestClassifier(n_jobs=-1)
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [53]:
clf.score(X_train,y_train)

0.9743178170144462

In [54]:
clf.score(X_test,y_test)

0.8470149253731343

In [55]:
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
Fare,0.281139
Age,0.252484
Sex_male,0.243593
Pclass,0.092596
SibSp,0.052443
Parch,0.038899
Embarked_S,0.029844
Embarked_Q,0.008794
Embarked_nan,0.000208
Sex_nan,0.0


In [92]:
test['Survived']= clf.predict(X.loc['test'].drop('Survived',axis=1)).astype(int)

In [93]:
submission = test[['PassengerId','Survived']]

In [94]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [95]:
submission.to_csv('./data/submission.csv',index=False)

This submission received an accuracy score of 0.75598 in the top 85%! Not so great but a point to start from.