In [1]:
# Use the Azure Machine Learning data source package
from azureml.dataprep import datasource

# classifier models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# modules to handle data
import pandas as pd
import numpy as np

Kaggle provides a test and a train dataset. The training data provides a Survived column which shows a 1 if the passenger survived and a 0 if they did not. This is ultimately the feature we are trying to predict so the test set will not have this column.

Because I’m lazy and don’t like doing things twice, I first loaded the data into a train and test variable and then created a titanic variable where I appended the test to the train so that I can create new features to both data sets at the same time. I also created an index for each train and test so that I can separate them out later into their respective train and test.

I. Data Wrangling and Preprocessing

In [2]:
# load data 
train = datasource.load_datasource('train.dsource')
test = datasource.load_datasource('test.dsource')

# save PassengerId for final submission
passengerId = test.PassengerId

# merge train and test
titanic = train.append(test, ignore_index=True)

# create indexes to separate data later on
train_idx = len(train)
test_idx = len(titanic) - len(test)

In [3]:
# view head of data 
titanic.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,male,1.0,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,female,1.0,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,female,0.0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,female,1.0,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,male,0.0,0.0,373450


In [6]:
# get info on features
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null float64
PassengerId    1309 non-null float64
Pclass         1309 non-null float64
Sex            1309 non-null object
SibSp          1309 non-null float64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(7), object(5)
memory usage: 122.8+ KB


This shows us all the features (or columns) in the data frame along with the count of non-null values. Looking at the RangeIndex we see that there are 1309 total entries, but the Age, Cabin, Embarked, Fare, and Survived have less than that, suggesting that those columns have some null, or NaN, values. This is a dirty dataset and we either need to drop the rows with NaN values or fill in the gaps by leveraging the data in the dataset to estimate what those values could have been. We will choose the latter and try to estimate those values and fill in the gaps rather than lose observations. However, one thing to note is that the Survived feature will not require us to fill in the gaps as the count of 891 represents the labels from the train data. Remember that we are trying to predict the Survived column and so the test set does not have this column at all.

Even though this is technically the “Data Wrangling” section, before we do any data wrangling and address any missing values, I first want to create a Title feature which simply extracts the honorific from the Name feature. Simply put, an honorific is the title or rank of a given person such as “Mrs” or “Miss”. The following code takes a value like “Braund, Mr. Owen Harris” from the Name column and extracts “Mr”:

In [7]:
# create a new feature to extract title names from the Name column
titanic['Title'] = titanic.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

In [8]:
titanic.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,male,1.0,0.0,A/5 21171,Mr
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,female,1.0,1.0,PC 17599,Mrs
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,female,0.0,1.0,STON/O2. 3101282,Miss
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,female,1.0,1.0,113803,Mrs
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,male,0.0,0.0,373450,Mr


In [13]:
titanic.Title.unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer', 'Dona'], dtype=object)

After viewing the unique Titles that were pulled, we see that we have 18 different titles but we will want to normalize these a bit so that we can generalize a bit more. To do this, we will create a dictionary that maps the 18 titles to 6 broader categories and then map that dictionary back to the Title feature.

In [14]:
# normalize the titles
normalized_titles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}

# map the normalized titles to the current titles 
titanic.Title = titanic.Title.map(normalized_titles)

# view value counts for the normalized titles
print(titanic.Title.value_counts())

Mr         757
Miss       262
Mrs        200
Master      61
Officer     23
Royalty      6
Name: Title, dtype: int64


The reason I wanted to create the Title feature first was so that I could use it to estimate the missing ages just a little bit better. The next step is to estimate the missing Age values. To do this, we will group the dataset by Sex, Pclass (Passenger Class), and Title.

In [15]:
# group by Sex, Pclass, and Title 
grouped = titanic.groupby(['Sex','Pclass', 'Title'])  

# view the median Age by the grouped features 
grouped.Age.median()

Sex     Pclass  Title  
female  1.0     Miss       30.0
                Mrs        45.0
                Officer    49.0
                Royalty    39.0
        2.0     Miss       20.0
                Mrs        30.0
        3.0     Miss       18.0
                Mrs        31.0
male    1.0     Master      6.0
                Mr         41.5
                Officer    52.0
                Royalty    40.0
        2.0     Master      2.0
                Mr         30.0
                Officer    41.5
        3.0     Master      6.0
                Mr         26.0
Name: Age, dtype: float64

Instead of simply filling in the missing Age values with the mean or median age of the dataset, by grouping the data by a passenger’s sex, class, and title, we can drill down a bit deeper and get a closer approximation of what a passenger’s age might have been. Using the grouped.Age variable, we can fill in the missing values for Age.

In [16]:
# apply the grouped median value on the Age NaN
titanic.Age = grouped.Age.apply(lambda x: x.fillna(x.median()))

Next, we move onto the next features with missing values, Cabin, Embarked, and Fare. For these, we wont be doing anything too fancy. We will fill Cabin with “U” for unknown, Embarked we will fill with the most frequent point of embarkment, and since Fare only has 1 missing value we will just fill it in with the median value of the dataset:

In [17]:
# fill Cabin NaN with U for unknown
titanic.Cabin = titanic.Cabin.fillna('U')

# find most frequent Embarked value and store in variable
most_embarked = titanic.Embarked.value_counts().index[0]

# fill NaN with most_embarked value
titanic.Embarked = titanic.Embarked.fillna(most_embarked)

# fill NaN with median fare
titanic.Fare = titanic.Fare.fillna(titanic.Fare.median())

# view changes
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null float64
PassengerId    1309 non-null float64
Pclass         1309 non-null float64
Sex            1309 non-null object
SibSp          1309 non-null float64
Survived       891 non-null float64
Ticket         1309 non-null object
Title          1309 non-null object
dtypes: float64(7), object(6)
memory usage: 133.0+ KB


Everything looks good now. As expected, Survived still has missing values but since we are going to eventually be splitting the data back to train and test, we can ignore that for now.

II. Feature Engineering

We will quickly create two more features before we begin our modeling. The next feature of interest is family size per passenger, since having a larger family may have made it harder to secure a spot on a life boat compared to an individual passenger or a small family trying to get on a life boat. We can leverage the SibSp and Parch features to determine family size since these are a count of sibling/spouse and parent/children respectively per passenger.

In [18]:
# size of families (including the passenger)
titanic['FamilySize'] = titanic.Parch + titanic.SibSp + 1

The last feature we will create will leverage the Cabin feature and simply extract the first letter of the cabin which determines the section where the room would have been. This is potentially relevant since it is possible that some cabins were closer to the life boats and thus those that were closer to them may have had a greater chance at securing a spot.

In [19]:
titanic.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,FamilySize
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,male,1.0,0.0,A/5 21171,Mr,2.0
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,female,1.0,1.0,PC 17599,Mrs,2.0
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,female,0.0,1.0,STON/O2. 3101282,Miss,1.0
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,female,1.0,1.0,113803,Mrs,2.0
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,male,0.0,0.0,373450,Mr,1.0


The last step to perform before we can begin our modeling is convert all our categorical features to numbers, as our algorithms can only take an array of numbers as an input, not names or letters. As you noticed from the previous screenshot, we have a few columns to convert. We use the pd.get_dummies() method from Pandas that converts categorical features into dummy variables.

In [20]:
# Convert the male and female groups to integer form
titanic.Sex = titanic.Sex.map({"male": 0, "female":1})

# create dummy variables for categorical features
pclass_dummies = pd.get_dummies(titanic.Pclass, prefix="Pclass")
title_dummies = pd.get_dummies(titanic.Title, prefix="Title")
cabin_dummies = pd.get_dummies(titanic.Cabin, prefix="Cabin")
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix="Embarked")

# concatenate dummy columns with main dataset
titanic_dummies = pd.concat([titanic, pclass_dummies, title_dummies, cabin_dummies, embarked_dummies], axis=1)

# drop categorical fields
titanic_dummies.drop(['Pclass', 'Title', 'Cabin', 'Embarked', 'Name', 'Ticket'], axis=1, inplace=True)

titanic_dummies.head()

Unnamed: 0,Age,Fare,Parch,PassengerId,Sex,SibSp,Survived,FamilySize,Pclass_1.0,Pclass_2.0,...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_,Embarked_C,Embarked_Q,Embarked_S
0,22.0,7.25,0.0,1.0,0,1.0,0.0,2.0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,38.0,71.2833,0.0,2.0,1,1.0,1.0,2.0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,26.0,7.925,0.0,3.0,1,0.0,1.0,1.0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,35.0,53.1,0.0,4.0,1,1.0,1.0,2.0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,35.0,8.05,0.0,5.0,0,0.0,0.0,1.0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [21]:
titanic.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,FamilySize
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,0,1.0,0.0,A/5 21171,Mr,2.0
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,1,1.0,1.0,PC 17599,Mrs,2.0
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,1,0.0,1.0,STON/O2. 3101282,Miss,1.0
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,1,1.0,1.0,113803,Mrs,2.0
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,0,0.0,0.0,373450,Mr,1.0


Perfect! Our data is now in the format we need to perform some modeling. Let’s separate it back into train and test data frames using the train_idx and test_idx we created in the beginning of the exercise. We will also separate our training data into X for the predictor variables and y for our response variable which in this case is the Survived labels.

In [22]:
# create train and test data
train = titanic_dummies[ :train_idx]
test = titanic_dummies[test_idx: ]

# convert Survived back to int
train.Survived = train.Survived.astype(int)

# create X and y for data and target values 
X = train.drop('Survived', axis=1).values 
y = train.Survived.values

# create array for test set
X_test = test.drop('Survived', axis=1).values

III. Modeling

I tested both a logistic regression model, which is a binary classifier, and a random forrest classifier model which fits a number of decision tree classifiers on the data. I used GridSearchCV to pass in a range of parameters and have it return the best score and the associated parameters.

The logistic regression model returned a best score of ~82% while the random forrest model got a best score of ~84% which is the model I ended up using for my predictions. As a result, I will only cover the random forrest model in this section.

GridSearchCV needs the estimator argument which in this case is the random forrest model and a param_grid which is a dictionary of parameters for the estimator. To prevent this post from being longer than it needs to be, I will let you look up the documentation for the random forrest classifier to find out what the parameters do.

First, I created my dictionary of parameters with different ranges:

In [23]:
# create param grid object 
forrest_params = dict(     
    max_depth = [n for n in range(9, 14)],     
    min_samples_split = [n for n in range(4, 11)], 
    min_samples_leaf = [n for n in range(2, 5)],     
    n_estimators = [n for n in range(10, 60, 10)],
)

Next, I instantiate the random forrest classifier:

In [24]:
# instantiate Random Forest model
forrest = RandomForestClassifier()

Lastly, we build the GridSearchCV and fit the model:

In [25]:
# build and fit model 
forest_cv = GridSearchCV(estimator=forrest, param_grid=forrest_params, cv=6) 
forest_cv.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': [9, 10, 11, 12, 13], 'n_estimators': [10, 20, 30, 40, 50], 'min_samples_split': [4, 5, 6, 7, 8, 9, 10], 'min_samples_leaf': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

Once this finishes (and it will take quite a few minutes depending on your computer’s speed) you can use the best_score_ and best_estimator_ methods to retrieve the best score and the parameters that led to that score:

In [26]:
print("Best score: {}".format(forest_cv.best_score_))
print("Optimal params: {}".format(forest_cv.best_estimator_))

Best score: 0.8383838383838383
Optimal params: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=13, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            n_estimators=40, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


Now we are ready to predict and submit! Remember that we saved the test set under X_test, so we can simply do the following:

In [27]:
# random forrest prediction on test set
forrest_pred = forest_cv.predict(X_test)

forrest_pred returns a 418 x 1 array of predictions for the Survived values. In the very first step, I placed the PassengerId column from the original test data into its own variable that I named passengerId. For our final submission, all we have to do is combine the passengerId with forrest_pred into a data frame and output to a csv. The following code does this:

In [29]:
# dataframe with predictions
kaggle = pd.DataFrame({'PassengerId': passengerId, 'Survived': forrest_pred})

# save to csv
kaggle.to_csv('titanic_pred.csv', index=False)