# Titanic dataset

In this notebook, I practise the techniques for building a classification model. I will use the Titanic dataset to try and predict whether or not a passenger survives the disaster based on some features.

This dataset acn be found at https://www.kaggle.com/c/titanic

To begin with lets import the training and test sets and have a look at the training data. These are already split for us so we don't have to.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
train = pd.read_csv("datasets/titanic/train.csv")
test = pd.read_csv("datasets/titanic/test.csv")

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


We see that there are 11 features and 1 class (survived). There are 891 instances. All the feature columns are full except for Age, Cabin and Embarked. Since the Age and Embarked attributes are almost full, we will try to fill them in using some of the other features (try to find some correlations or use the mean values) Since the cabin feature is only about 25% full, we will drop this feature for now.

Now lets separate the training data into the features and labels:

In [13]:
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']

In [14]:
X_train.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


The categories for sex and Embarked will need to be encoded, using binary encoding for Sex (0 for male, 1 for female) and one-hot-encoding for Embarked. Also, passenger ID is not an attribute we want to train the data on, so we will drop this attribute as well as cabin. Since the Ticket number is also seeminly random and unstructured, we will drop this column also.

So the final dataset has the attributes:

PClass, Sex, Age, SibSp, Parch, Fare, Embarked

In [15]:
X_train_reduced = X_train[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
X_train_reduced.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


Lets create a class to select the attributes we want so we can put it into a pipeline for the test set. We will also drop any instances with NaN values in the categorical data columns (Sex and Embarked):

In [152]:
from sklearn.base import BaseEstimator, TransformerMixin

X_attr = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
y_attr = ['Survived']

class AttributeSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, attributes):
        self.attributes = attributes
        
    def fit(self):
        return self
    
    def transform(self, X, y=None):
        return X[self.attributes]
    
    def fit_transform(self, X, y=None):
        return X[self.attributes]
    

train.dropna(subset=['Sex', 'Embarked'], inplace=True)
train.info()

attrib_select_X = AttributeSelector(X_attr)
attrib_select_y = AttributeSelector(y_attr)

X_train = attrib_select_X.transform(train)
y_train = attrib_select_y.transform(train)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Pclass         889 non-null int64
Name           889 non-null object
Sex            889 non-null object
Age            712 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Ticket         889 non-null object
Fare           889 non-null float64
Cabin          202 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.3+ KB


In [138]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 7 columns):
Pclass      889 non-null int64
Sex         889 non-null object
Age         712 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 55.6+ KB


Now we need to deal with missing values. For the age, we will simply use the mean value:

In [139]:
from sklearn.preprocessing import Imputer, StandardScaler

num_attributes = X_train[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
imputer = Imputer(strategy="mean")

num_attributes = imputer.fit_transform(num_attributes)

scaler = StandardScaler()

num_attributes = scaler.fit_transform(num_attributes)

We also need to encode the categorical data:

In [140]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

sex = X_train['Sex'].values
sex_encoder = LabelEncoder()
sex_encoded = sex_encoder.fit_transform(sex)

embarked = X_train['Embarked'].values
embarked_encoder = LabelEncoder()
embarked_encoded = embarked_encoder.fit_transform(embarked)

onehot = OneHotEncoder(sparse=False)
embarked_encoded = onehot.fit_transform(embarked_encoded.reshape(-1, 1))[:, :-1]
# I dropped one of the embarked columns since one of the classes is a dummy variable.

In [141]:
embarked_encoded

array([[0., 0.],
       [1., 0.],
       [0., 0.],
       ...,
       [0., 0.],
       [1., 0.],
       [0., 1.]])

In [156]:
X_train_clean = np.c_[num_attributes, sex_encoded, embarked_encoded]
X_train_clean[0, :]

array([ 0.82520863, -0.58961986,  0.43135024, -0.47432585, -0.50023975,
        1.        ,  0.        ,  0.        ])

In [143]:
print(X_train_clean.shape, y_train.shape)

(889, 8) (889, 1)


Ok, now the trainnig data is all prepared. We can now start to choose our models. This is a binary classification task. The models I will investigate are:

1. SGDClassifier
2. RandomForestClassifier

I will use the precision, reacll and accuracy to evaluate each model and to tune the hyperparameters. 

To begin with, I will combine all the data prep stages into a single pipeline so preparing the test set will be easy.

#### Pipeline

In [170]:
from sklearn.pipeline import Pipeline, FeatureUnion

num_attr = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
sex_attr = ['Sex']
emb_attr = ['Embarked']

# Create a transformer to drop instances of NaN
class DropInstance(BaseEstimator, TransformerMixin):
    
    def transform(self, X, attributes, y=None):
        return X.dropna(subset=attributes)
    
    def fit_transform(self, X, attributes, y=None):
        return X.dropna(subset=attributes)
    

# Modified label encoder to feed into onehot encoder
class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y.values.ravel()).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y.values.ravel()).reshape(-1, 1)
    

# Modified onehot encoder to delete dummy variable
class ModifiedOneHot(OneHotEncoder):
    
    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y)[:, :-1]

    def transform(self, y, *args, **kwargs):
        return super().transform(y)[:, :-1]

    
num_pipeline = Pipeline([
    ('Selector', AttributeSelector(num_attr)),
    ('imputer', Imputer(strategy="mean")),
    ('scaler', StandardScaler()),
])

sex_pipeline = Pipeline([
    ('Selector', AttributeSelector(sex_attr)),
    ('encoder', ModifiedLabelEncoder()),
])

emb_pipeline = Pipeline([
    ('Selector', AttributeSelector(emb_attr)),
    ('encoder', ModifiedLabelEncoder()),
    ('onehotencoder', ModifiedOneHot(sparse=False)),
])

full_pipeline_X = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('sex_pipeline', sex_pipeline),
    ('emb_piepline', emb_pipeline),
])

titanic_prepared = full_pipeline_X.fit_transform(X_train)

#### SGDClassifier: