# Please **upvote** if you like this notebook!

# Summary
Hi everyone! This is my first time joining a compeition and sharing my work with others. 

In this notebook I will share my approach for solving the Titianic Dataset with Interactive Plots. Here is the link for plotly tutorial: https://www.kaggle.com/desalegngeb/plotly-guide-customize-for-better-visualizations

This notebook will be divided into these stages:
1. Exploratory Data Analysis (EDA) and Feature Engineering
1. Choosing the Best Model
1. Hyperparameter Tuning

Don't forget to upvote this notebook if you enjoyed it, and if you have any thoughts please feel free to comment below!

# Section 1: Exploratory Data Analysis (EDA) and Feature Engineering

## Section 1.1: Import relevant libraries and peek raw data

In [None]:
from matplotlib import pyplot as plt
import string
import missingno as miss
import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor, plot_tree

In [None]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
test_passenger_id = test.PassengerId

In [None]:
# Read first few rows
train.head()

## Section 1.2: Missing values
So far so good. It seems that there are some missing data for Age and Cabin Column. We will further investigate it by generating a missing value matrix. Referenced from: https://www.kaggle.com/rushikeshdarge/handle-missing-values-only-notebook-you-need

In [None]:
# Missing data matrix
miss.matrix(train);

This matrix shows that there are some missing values for Age, Cabin and Embarked. We can check the missing values and percentage with the following function.

In [None]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)

    # Coumn for dtypes
    dtype = df.dtypes

    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent, dtype], axis=1)

    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values', 2: 'Data Types'})

    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

    # Print some summary information
    print("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
                                                              "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [None]:
train_miss = missing_values_table(train)
train_miss

From the above table we can see that **80%** of the Cabin values are missing! And we can do similar analysis for the test set.

In [None]:
# Check for test cases
miss.matrix(test)
test_miss = missing_values_table(test)
test_miss

## Section 1.3: Analysis by columns

### Section 1.3.1: PassengerId
PassengerId Column can be dropped since this Column is clearly irrelavent to the prediction of Survival. However we will be keeping test case's PassengerId for future use.

In [None]:
train.drop(columns='PassengerId', inplace=True)
test.drop(columns='PassengerId', inplace=True)

### Section 1.3.2: Survived
After some exploration we can plot the survival distribution in the training set.

In [None]:
survived_percentage = (train[train['Survived'] == 1]['Survived'].sum()) / train.shape[0] * 100
not_survived_percentage = 100 - survived_percentage

In [None]:
fig = px.bar(train, x=["Survived", "Not Survived"], y=[survived_percentage, not_survived_percentage], color=["Survived", "Not Survived"],
                width=600,height=350,
                color_discrete_map={ 
                    "Survived": "mediumturquoise", "Not Survived": "lightsalmon"
                },
                labels=dict(x = "Survived or Not", y="Percentage", color="Place"),
                )

fig.update_layout(
    title={
        'text': "Survival Rate Distribution",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})


fig.show();

You can hover on the bar and see the actual percentage for each category. Sadly most of the people could not survive according to the training set :(

### Section 1.3.3: Pclass
Pclass refers to the class that passenger belongs to.
* `Pclass = 1` refers to first class
* `Pclass = 2` refers to second class
* `Pclass = 3` refers to third class

As shown in the following plot, passengers on first class had a higher survival rate than that of second class, and second class passenger had a higher survival rate than that of third class.

In [None]:
pclass=['First Class(1)', 'Second class(2)', 'Third Class(3)']

survived_count = train.groupby('Pclass').sum()['Survived']
total_count = train.groupby('Pclass').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
                    go.Bar(name='Survived', x=pclass, y=survived_count, marker_color='mediumturquoise'),
                    go.Bar(name='Not Survived', x=pclass, y=not_survived, marker_color='lightsalmon'),],)
# Change the bar mode
fig.update_layout(width=700,
                  height=350,
                  barmode='group',
                  xaxis = dict(title="Pclass"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Pclass",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

It is obvious that Passenger Class is a categorical value, so we could use One-Hot Encoding by adding three different columns, each representing whether that Passenger is in that Class.

In [None]:
# One Hot Encoding function
def categorical_encode(name, training_set, test_set):
    df = pd.concat([training_set[name], test_set[name]])
    encoder = OneHotEncoder(handle_unknown='ignore')
    features = encoder.fit_transform(df.values.reshape(-1, 1)).toarray()
    
    # Find unique number of encodings
    n = df.nunique()
    cols = ['{}_{}'.format(name, n) for n in range(1, n + 1)]
    
    # Create a new dataframe and re-indexing
    encoded_df = pd.DataFrame(features, columns=cols)
    encoded_df.index = df.index
    for col in encoded_df.columns:
        encoded_df = encoded_df.astype({col: 'object'})
    
    training_set = pd.concat([training_set, encoded_df[:training_set.shape[0]]], axis=1)
    test_set = pd.concat([test_set, encoded_df[training_set.shape[0]:]], axis=1)
    return training_set, test_set

In [None]:
train, test = categorical_encode('Pclass', train, test)

Great! Let's investigate the next column.

### Section 1.3.4: Sex
From the Survival Count we can see that Female generally has higher Survival Rate than Male. If you have seen the movie Titanic, places on life boat are prioritized to female and children!

In [None]:
sex = pd.concat([train['Sex'], test['Sex']]).unique()

survived_count = train.groupby('Sex').sum()['Survived']
total_count = train.groupby('Sex').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=sex, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=sex, y=not_survived, marker_color='lightsalmon'),
])
# Change the bar mode
fig.update_layout(width=600,
                  height=350,
                  barmode='group',
                  xaxis = dict(title="Sex"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Sex",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

In [None]:
train, test = categorical_encode('Sex', train, test)

### Section 1.3.5: SibSp/Parch

For these two column of features, we could merge it into `Family_numbers` column. As we plot the survival against `Family_members` column, we can see a distinct pattern where some family sizes have a higher survival rate than other groups.

We will create a new column with the name `Family_numbers`, calculated by the sum of SibSp, Parch and 1. We have to add 1 in order to take account of the passenger.

In [None]:
train['Family_members'] = train['SibSp'] + train['Parch'] + 1
test['Family_members'] = test['SibSp'] + test['Parch'] + 1

In [None]:
members = sorted(pd.concat([train['Family_members'], test['Family_members']]).unique())

survived_count = train.groupby('Family_members').sum()['Survived']
total_count = train.groupby('Family_members').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=1000,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Family size"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Family Size",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

Let's also investigate the survival rate for each family size.

In [None]:
percentage_survived = survived_count / total_count * 100

fig = go.Figure(go.Bar(name='Survived', 
                       x=members, 
                       y=percentage_survived, 
                       marker={
                            'color': percentage_survived,
                            'colorscale': 'Viridis'
                        }))
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=600,
                  height=400,
                  barmode='group',
                  xaxis = dict(title="Family size"),
                  yaxis = dict(title="Survival Percentage"))
fig.update_layout(
    title={
        'text': "Survival Rate with respect to Family Size",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

From the bar plot above, we can divide family size into 3 different groups, with a distinctive survival rate:
* `Single` with `Family_members = 1`
* `Small` with `Family_members` between `2` and `4`
* `Large` with `Family_members` greater or equal to `5`

Another reason to make this choice is that when family size is greater or equal to 4, there are only a handful of sample points so it would be reasonable to group them together.


Now we can transform this column into a new feature, `Family_cat`.

In [None]:
train['Family_cat'] = [('Single' if member == 1 else (
                        'Large' if member >= 5 else 'Small')) for member in train['Family_members']]
test['Family_cat'] = [('Single' if member == 1 else (
                        'Large' if member >= 5 else 'Small')) for member in test['Family_members']]

After merging we can plot the survival rate again:

In [None]:
members = sorted(pd.concat([train['Family_cat'], test['Family_cat']]).unique())

survived_count = train.groupby('Family_cat').sum()['Survived']
total_count = train.groupby('Family_cat').count()['Survived']

percentage_survived = survived_count / total_count * 100

fig = go.Figure(go.Bar(name='Survived', 
                       x=members, 
                       y=percentage_survived, 
                       marker={
                            'color': percentage_survived,
                            'colorscale': 'Viridis'
                        }))
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=600,
                  height=400,
                  barmode='group',
                  xaxis = dict(title="Family Category"),
                  yaxis = dict(title="Survival Percentage"))
fig.update_layout(
    title={
        'text': "Survival Rate with respect to Family Category",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

Last but not least we will transform the categorical labels into one hot encoding.

In [None]:
train, test = categorical_encode('Family_cat', train, test)

### Section 1.3.6: Embarked
This field represents which Port did the passenger embark on Titanic. There are three possible entries:
* `Embarked = S`: The passenger embarkes from the port Southampton
* `Embarked = C`: The passenger embarks from the port Cherbourg
* `Embarked = Q`: The passenger embarks from the port Queenstown

Note that there are two missing `Embarked` entry in the training set. One of the possible way to deal with it is to replace the missing value with the most frequent value, i.e.`Embarked = S`. 

Then we will plot the Survival Count against Embarked.

In [None]:
indicies = train[train['Embarked'].isnull()].index.tolist()
train.loc[indicies,'Embarked'] = 'S'

In [None]:
members = sorted(pd.concat([train['Embarked'], test['Embarked']]).unique())

survived_count = train.groupby('Embarked').sum()['Survived']
total_count = train.groupby('Embarked').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=600,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Port Embarked"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Port Embarked",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

Turns out that Ported Embarked is not a redundant feature. As we can see that, `Embarked = S` has a lower survival rate than other groups. As usual, we will transform the categorical labels into one hot encoding.

In [None]:
train, test = categorical_encode('Embarked', train, test)

In [None]:
train.columns

### Section 1.3.7: Age
This column represents the age of a passenger. Note that around **20%** of the test rows and **20%** of the train rows have missing age. In order to impute the missing value, we will train a shallow decision tree with age as the target.

In [None]:
df = train.copy()
df.reset_index(inplace=True)
df.drop(columns='index', inplace=True)

# Also drop Name, Ticket and Cabin and redundant columns
df.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin', 'Pclass', 'Sex', 'SibSp', 
                 'Parch', 'Embarked', 'Family_members', 'Family_cat'], inplace=True)

In [None]:
# Preparation work to build a decision tree
temp_df = df.copy()
temp_df.dropna(inplace=True)
age_Y = temp_df.Age
temp_train = temp_df.drop(columns=['Age'])

In [None]:
# Depth of 5 is a hyperparameter
age_model = DecisionTreeRegressor(max_depth=5)
age_model.fit(temp_train, age_Y)

In [None]:
fig = plt.figure(figsize=(25,20))
plot_tree(age_model, fontsize=18, max_depth=3, impurity=False, feature_names=temp_train.columns);

As a reference, 
* `Pclass_1` is the encoder for first class
* `Pclass_2` is the encoder for second class
* `Pclass_3` is the encoder for third class
* `Sex_1` is the encoder for female
* `Sex_2` is the encoder for male
* `Embarked_1` is the encoder for Cherbourg
* `Family_cat_1` is the encoder for Large Family Size
* `Family_cat_2` is the encoder for Single Family
* `Family_cat_3` is the encoder for Small Family

And the result from Decision Tree matches our intuition. For instance, at the root if the passenger is in first class the average age would be 38.233, else the average age would be 26.693.

Now we could fit the tree to our missing ages.

In [None]:
na_age_index = train[train['Age'].isna()]
na_age_index = na_age_index.drop(columns=['Survived', 'Age', 'Name', 'Ticket', 'Cabin', 'Pclass', 'Sex', 
                                          'SibSp', 'Parch', 'Embarked', 'Family_members', 'Family_cat'])
age_na_pred = age_model.predict(na_age_index)
age_fill_na = train[train['Age'].isna()].index
train.loc[age_fill_na,'Age'] = age_na_pred

na_age_index_test = test[test['Age'].isna()]
na_age_index_test = na_age_index_test.drop(columns=['Age', 'Name', 'Ticket', 'Cabin', 'Pclass', 'Sex', 
                                                    'SibSp', 'Parch', 'Embarked', 'Family_members', 'Family_cat'])
age_na_pred_test = age_model.predict(na_age_index_test)
age_fill_na_test = test[test['Age'].isna()].index
test.loc[age_fill_na_test,'Age'] = age_na_pred_test

After filling in the missing values, we can plot the distribution of age in both train set and test set, and check if they are similar.

In [None]:
surv = train['Age']
vict = test['Age']

group_labels = ['Train Set', 'Test Set']

fig = make_subplots(
    rows=1, cols=2, subplot_titles=("Age Distribution", "Survival Distribution by Age")
)

fig2 = ff.create_distplot([surv, vict],
                         group_labels, 
                         show_hist=False, 
                         show_rug=False,
                         )

fig.add_trace(go.Scatter(fig2['data'][0],
                           marker_color='blue'
                          ), row=1, col=1)
fig.add_trace(go.Scatter(fig2['data'][1],
                           marker_color='red'
                          ), row=1, col=1)

fig.update_xaxes(title_text="Age", row=1, col=1)
fig.update_xaxes(title_text="Age", row=1, col=2)
fig.update_yaxes(title_text="Probability Distribution", row=1, col=1)
fig.update_yaxes(title_text="Probability Distribution", row=1, col=2)

surv = train[train['Survived'] == 1]['Age']
vict = train[train['Survived'] == 0]['Age']

group_labels = ['Survived', 'Not Survived']

fig3 = ff.create_distplot([surv, vict],
                         group_labels, 
                         show_hist=False, 
                         show_rug=False,
                         )

fig.add_trace(go.Scatter(fig3['data'][0],
                           marker_color='orange'
                          ), row=1, col=2)
fig.add_trace(go.Scatter(fig3['data'][1],
                           marker_color='green'
                          ), row=1, col=2)
fig.show();


### Section 1.3.8: Cabin


For Cabin column, there are a lot of missing values in both training and testing set. In particular, around **80%** of the data are missing. In order to deal with this, we have to have a look at the cross section of the Titanic.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Olympic_%26_Titanic_cutaway_diagram.png/330px-Olympic_%26_Titanic_cutaway_diagram.png)

As we can see, the first Character of the Cabin Column represents the Deck, so we could extract this feature. As for missing data, it is almost impossible to do imputation since most of the cabin data are missing, so we will replace missing value with Deck 'M', which stands for missing deck.

In [None]:
train['Deck'] = ['M' if pd.isnull(string) else string[0]  for string in train['Cabin']]
test['Deck'] = ['M' if pd.isnull(string) else string[0]  for string in test['Cabin']]

In [None]:
decks = pd.concat([train['Deck'], test['Deck']])
decks.value_counts()

It seems that `Deck = T` is not shown in the above diagram. After some investigation, Deck T is in fact the boat deck, so it would make sense if we merge deck T with deck A.

In [None]:
train['Deck'] = train['Deck'].replace(['T'],'A')

Let's plot the survival count and survival rate against Deck.

In [None]:
members = sorted(pd.concat([train['Deck'], test['Deck']]).unique())

survived_count = train.groupby('Deck').sum()['Survived']
total_count = train.groupby('Deck').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=1000,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Deck"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Deck",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

In [None]:
percentage_survived = survived_count / total_count * 100

fig = go.Figure(go.Bar(name='Survived', 
                       x=members, 
                       y=percentage_survived, 
                       marker={
                            'color': percentage_survived,
                            'colorscale': 'Viridis'
                        }))
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=600,
                  height=400,
                  barmode='group',
                  xaxis = dict(title="Deck"),
                  yaxis = dict(title="Survival Percentage"))
fig.update_layout(
    title={
        'text': "Survival Rate with respect to Deck",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

Creating a new class 'M' seems makes sense: Most passenger in Deck 'M' could not survive. We can have a look at the class distribution for each Decks before we can merge these categories together.

In [None]:
temp_df = train.groupby(['Deck', 'Pclass'], as_index=False)['Survived'].count()
temp_df['Total_Survived'] = [total_count[deck] for deck in temp_df['Deck']]
temp_df['Percentage'] = temp_df['Survived'] / temp_df['Total_Survived'] * 100

In [None]:
fig = px.bar(temp_df, x="Deck", y="Percentage", color="Pclass")
fig.update_layout(width=600,
                  height=400,
                  barmode='group',
                  xaxis = dict(title="Deck"),
                  yaxis = dict(title="Percentage"))
fig.update_layout(
    title={
        'text': "Pclass Distribution with respect to Deck",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

We can see how to group the Decks:
* 'Deck = A', 'Deck = B', 'Deck = C', 'Deck = D', 'Deck = E' can be groupped to upper deck, since most of the passengers are first class.
* 'Deck = F' can be groupped to middle deck, since most of the passengers are second class.
* 'Deck = G' can be groupped to lower deck, since most of the passengers are third class.
* 'Deck = M' can be groupped to missing deck.

In [None]:
train['Deck_cat'] = ['upper' if (deck >= 'A' and deck <= 'E') else
                     ('middle' if deck == 'F' else
                     ('lower' if deck == 'G' else 'missing'))
                     for deck in train['Deck']]
test['Deck_cat'] = ['upper' if (deck >= 'A' and deck <= 'E') else
                     ('middle' if deck == 'F' else
                     ('lower' if deck == 'G' else 'missing'))
                     for deck in test['Deck']]

In [None]:
train, test = categorical_encode('Deck_cat', train, test)

### Section 1.3.9: Fare

The column Fare is pretty much the same as Age: They both have missing values and they are continuous variables. We can reuse the code from Age and do similar analysis. For sure using Decision Tree would be an overkill to estimate one missing value, but resuing code is more convenient :)

In [None]:
df = train.copy()
df.reset_index(inplace=True)
df.drop(columns='index', inplace=True)

# Also drop Name, Ticket and Cabin and redundant columns
df.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin', 'Pclass', 'Sex', 'SibSp', 
                 'Parch', 'Embarked', 'Family_members', 'Family_cat', 'Deck', 'Deck_cat'], inplace=True)

# Preparation work to build a decision tree
temp_df = df.copy()
temp_df.dropna(inplace=True)
fare_Y = temp_df.Fare
temp_train = temp_df.drop(columns=['Fare'])

In [None]:
# Depth of 5 is a hyperparameter
fare_model = DecisionTreeRegressor(max_depth=5)
fare_model.fit(temp_train, fare_Y)

In [None]:
na_fare_index_test = test[test['Fare'].isna()]
na_fare_index_test = na_fare_index_test.drop(columns=['Fare', 'Name', 'Ticket', 'Cabin', 'Pclass', 'Sex', 
                                                    'SibSp', 'Parch', 'Embarked', 'Family_members', 'Family_cat', 'Deck', 'Deck_cat'])
fare_na_pred_test = fare_model.predict(na_fare_index_test)
fare_fill_na_test = test[test['Fare'].isna()].index
test.loc[fare_fill_na_test,'Fare'] = fare_na_pred_test

We can then plot the Fare Distribution and Survival Distribution agsinst Fare.

In [None]:
train_fare = train['Fare']
test_fare = test['Fare']

group_labels = ['Train Set', 'Test Set']

fig = make_subplots(
    rows=1, cols=2, subplot_titles=("Fare Distribution", "Survival Distribution by Fare")
)

fig2 = ff.create_distplot([train_fare, test_fare],
                         group_labels, 
                         show_hist=False, 
                         show_rug=False,
                         )

fig.add_trace(go.Scatter(fig2['data'][0],
                           marker_color='blue'
                          ), row=1, col=1)
fig.add_trace(go.Scatter(fig2['data'][1],
                           marker_color='red'
                          ), row=1, col=1)

fig.update_xaxes(title_text="Fare", row=1, col=1)
fig.update_xaxes(title_text="Fare", row=1, col=2)
fig.update_yaxes(title_text="Probability Distribution", row=1, col=1)
fig.update_yaxes(title_text="Probability Distribution", row=1, col=2)

surv = train[train['Survived'] == 1]['Fare']
vict = train[train['Survived'] == 0]['Fare']

group_labels = ['Survived', 'Not Survived']

fig3 = ff.create_distplot([surv, vict],
                         group_labels, 
                         show_hist=False, 
                         show_rug=False,
                         )

fig.add_trace(go.Scatter(fig3['data'][0],
                           marker_color='orange'
                          ), row=1, col=2)
fig.add_trace(go.Scatter(fig3['data'][1],
                           marker_color='green'
                          ), row=1, col=2)
fig.show();


We can see that the higher the fare, the higher the survival probability. It is likely that the fare is correlated to the passenger class, which turns out affects the survival rate.

### Section 1.3.9: Ticket

After careful inspection we can see duplicate values of ticket. This shows that some of the passenger purchased the same ticket so they are at the same group. We can utilize this information and create a new column called `Ticket_count`.

In [None]:
df = pd.concat([train.drop(columns='Survived'), test], ignore_index=True)
ticket_group = df.groupby('Ticket').size()
ticket_group.name = 'Ticket_count'

In [None]:
df = df.join(ticket_group, on='Ticket')

In [None]:
train_column = train.columns
test_column = test.columns
train = pd.concat([train.Survived, df[:train.shape[0]]], ignore_index=True, axis=1)
train.columns = train_column.append(pd.Index(['Ticket_count']))
test = df[train.shape[0]:]

In [None]:
train['Ticket_count'].value_counts()

In [None]:
members = sorted(df['Ticket_count'].unique())

survived_count = train.groupby('Ticket_count').sum()['Survived']
total_count = train.groupby('Ticket_count').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=1000,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Ticket Count"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Ticket Count",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

It seems that Ticket count is highly correlated with family count from the survival count plot. Let's calculate the correlation.

In [None]:
corr = df['Ticket_count'].corr(df['Family_members'])
print("The correlation is: {}".format(corr))

Seems it is not (almost) perfectly correlated. We can transform it into categorical variable as we have done in Family section.

In [None]:
train['Ticket_cat'] = [('Single' if member == 1 else (
                        'Large' if member >= 5 else 'Small')) for member in train['Ticket_count']]
test['Ticket_cat'] = [('Single' if member == 1 else (
                        'Large' if member >= 5 else 'Small')) for member in test['Ticket_count']]

In [None]:
train, test = categorical_encode('Ticket_cat', train, test)

### Section 1.3.10: Name
Name column represents the name of each passenger. We could not drop the column since we can extract their title and infer their social status, for example a person with title Dr. might have a different survival rate compare to a person with title Mr. 

We can define a function to extract first name and the title of the passenger, and if possible, the marital status. These functions are inspired from this post: https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook

In [None]:
train['Title'] = train['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
train['Is_Married'] = 0
train['Is_Married'].loc[train['Title'] == 'Mrs'] = 1

test['Title'] = test['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
test['Is_Married'] = 0
test['Is_Married'].loc[test['Title'] == 'Mrs'] = 1

In [None]:
members = sorted(train['Title'].unique())

survived_count = train.groupby('Title').sum()['Survived']
total_count = train.groupby('Title').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=1000,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Title"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Title",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

We can now group into three groups, `Mr`, `Miss/Mrs/Ms` and `Master`. `Master` is isolated out since it has a higher survival rate than `Mr` class.

In [None]:
train['Title'] = train['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
train['Title'] = train['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')

test['Title'] = test['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
test['Title'] = test['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')

In [None]:
members = sorted(train['Title'].unique())

survived_count = train.groupby('Title').sum()['Survived']
total_count = train.groupby('Title').count()['Survived']
not_survived = total_count - survived_count

fig = go.Figure(data=[
    go.Bar(name='Survived', x=members, y=survived_count, marker_color='mediumturquoise'),
    go.Bar(name='Not Survived', x=members, y=not_survived, marker_color='lightsalmon'),
])
fig.update_xaxes(type='category')
# Change the bar mode
fig.update_layout(width=1000,
                  height=450,
                  barmode='group',
                  xaxis = dict(title="Title"),
                  yaxis = dict(title="Passenger Count"))
fig.update_layout(
    title={
        'text': "Survival Count with respect to Title",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show();

In [None]:
train, test = categorical_encode('Title', train, test)

Next we will encode the family and ticket survival rate, as suggested in this post: https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook#2.-Feature-Engineering

After such encoding, there are couple of new columns:
* Ticket_Survival_Rate
* Ticket_Survival_Rate_NA
* Family_Survival_Rate
* Family_Survival_Rate_NA
* Survival_Rate
* Survival_Rate_NA

In [None]:
df_all = pd.concat([train.drop(columns='Survived'), test], ignore_index=True)
train_column = train.columns
test_column = test.columns

In [None]:
def extract_surname(data):    
    
    families = []
    
    for i in range(len(data)):        
        name = data.iloc[i]

        if '(' in name:
            name_no_bracket = name.split('(')[0] 
        else:
            name_no_bracket = name
            
        family = name_no_bracket.split(',')[0]
        title = name_no_bracket.split(',')[1].strip().split(' ')[0]
        
        for c in string.punctuation:
            family = family.replace(c, '').strip()
            
        families.append(family)
            
    return families

df_all['Family'] = extract_surname(df_all['Name'])
df_train = df_all.loc[:890]
df_test = df_all.loc[891:]

df_train = pd.concat([train.Survived, df_all[:train.shape[0]]], ignore_index=True, axis=1)
df_train.columns = train_column.append(pd.Index(['Family']))

dfs = [df_train, df_test]

In [None]:
non_unique_families = [x for x in df_train['Family'].unique() if x in df_test['Family'].unique()]
non_unique_tickets = [x for x in df_train['Ticket'].unique() if x in df_test['Ticket'].unique()]

df_family_survival_rate = df_train.groupby('Family')[['Survived', 'Family','Family_members']].median()
df_ticket_survival_rate = df_train.groupby('Ticket')[['Survived', 'Ticket','Ticket_count']].median()

family_rates = {}
ticket_rates = {}

for i in range(len(df_family_survival_rate)):
    # Checking a family exists in both training and test set, and has members more than 1
    if df_family_survival_rate.index[i] in non_unique_families and df_family_survival_rate.iloc[i, 1] > 1:
        family_rates[df_family_survival_rate.index[i]] = df_family_survival_rate.iloc[i, 0]

for i in range(len(df_ticket_survival_rate)):
    # Checking a ticket exists in both training and test set, and has members more than 1
    if df_ticket_survival_rate.index[i] in non_unique_tickets and df_ticket_survival_rate.iloc[i, 1] > 1:
        ticket_rates[df_ticket_survival_rate.index[i]] = df_ticket_survival_rate.iloc[i, 0]

In [None]:
mean_survival_rate = np.mean(df_train['Survived'])

train_family_survival_rate = []
train_family_survival_rate_NA = []
test_family_survival_rate = []
test_family_survival_rate_NA = []

for i in range(len(df_train)):
    if df_train['Family'][i] in family_rates:
        train_family_survival_rate.append(family_rates[df_train['Family'][i]])
        train_family_survival_rate_NA.append(1)
    else:
        train_family_survival_rate.append(mean_survival_rate)
        train_family_survival_rate_NA.append(0)
        
for i in range(len(df_test)):
    if df_test['Family'].iloc[i] in family_rates:
        test_family_survival_rate.append(family_rates[df_test['Family'].iloc[i]])
        test_family_survival_rate_NA.append(1)
    else:
        test_family_survival_rate.append(mean_survival_rate)
        test_family_survival_rate_NA.append(0)
        
df_train['Family_Survival_Rate'] = train_family_survival_rate
df_train['Family_Survival_Rate_NA'] = train_family_survival_rate_NA
df_test['Family_Survival_Rate'] = test_family_survival_rate
df_test['Family_Survival_Rate_NA'] = test_family_survival_rate_NA

train_ticket_survival_rate = []
train_ticket_survival_rate_NA = []
test_ticket_survival_rate = []
test_ticket_survival_rate_NA = []

for i in range(len(df_train)):
    if df_train['Ticket'][i] in ticket_rates:
        train_ticket_survival_rate.append(ticket_rates[df_train['Ticket'][i]])
        train_ticket_survival_rate_NA.append(1)
    else:
        train_ticket_survival_rate.append(mean_survival_rate)
        train_ticket_survival_rate_NA.append(0)
        
for i in range(len(df_test)):
    if df_test['Ticket'].iloc[i] in ticket_rates:
        test_ticket_survival_rate.append(ticket_rates[df_test['Ticket'].iloc[i]])
        test_ticket_survival_rate_NA.append(1)
    else:
        test_ticket_survival_rate.append(mean_survival_rate)
        test_ticket_survival_rate_NA.append(0)
        
df_train['Ticket_Survival_Rate'] = train_ticket_survival_rate
df_train['Ticket_Survival_Rate_NA'] = train_ticket_survival_rate_NA
df_test['Ticket_Survival_Rate'] = test_ticket_survival_rate
df_test['Ticket_Survival_Rate_NA'] = test_ticket_survival_rate_NA

In [None]:
for df in [df_train, df_test]:
    df['Survival_Rate'] = (df['Ticket_Survival_Rate'] + df['Family_Survival_Rate']) / 2
    df['Survival_Rate_NA'] = (df['Ticket_Survival_Rate_NA'] + df['Family_Survival_Rate_NA']) / 2    
train = df_train
test = df_test

### Section 1.3.11: Conclusion

As a conclusion we can plot the confusion matrix between different features, as well as dropping some columns.

In [None]:
train_corr = train[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Cabin', 'Family_members', 'Ticket_count', 'Title', 'Is_Married']]
corr = train_corr.corr()
fig = ff.create_annotated_heatmap(
    z=corr.to_numpy().round(2),
    x=list(corr.index.values),
    y=list(corr.columns.values),       
    xgap=3, ygap=3,
    zmin=-1, zmax=1,
    colorscale='YlGnBu',
    colorbar_thickness=30,
    colorbar_ticklen=3,
)
fig.update_layout(title_text='Correlation Matrix (train set)',
                  title_x=0.5,
                  titlefont={'size': 24},
                  width=550, height=550,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',                   
                  paper_bgcolor=None,
                  )
fig.show()

As expected, `Family_members` and `Ticket_count` are highly positive correlated (0.82), whereas `Pclass` and `Fare` are highly negative correlated (-0.55).

In [None]:
test_corr = test[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Cabin', 'Family_members', 'Ticket_count', 'Title', 'Is_Married']]
corr = test_corr.corr()
fig = ff.create_annotated_heatmap(
    z=corr.to_numpy().round(2),
    x=list(corr.index.values),
    y=list(corr.columns.values),       
    xgap=3, ygap=3,
    zmin=-1, zmax=1,
    colorscale='YlGnBu',
    colorbar_thickness=30,
    colorbar_ticklen=3,
)
fig.update_layout(title_text='Correlation Matrix (test set)',
                  title_x=0.5,
                  titlefont={'size': 24},
                  width=550, height=550,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',                   
                  paper_bgcolor=None,
                  )
fig.show()

In [None]:
train.drop(columns=['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'Family_members', 'Family_cat', 'Deck', 'Deck_cat',
            'Ticket_count', 'Ticket_cat', 'Title', 'Family'], inplace=True)
test.drop(columns=['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'Family_members', 'Family_cat', 'Deck', 'Deck_cat',
            'Ticket_count', 'Ticket_cat', 'Title', 'Family'], inplace=True)

# Section 2: Choose the best model

In [None]:
# Import related libraries

from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid

In [None]:
# Define scoring and StratifiedKFold
scoring = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']
cv = StratifiedKFold(5, shuffle=True, random_state=42)

In [None]:
# Drop the columns and Standardize the columns
y = train.Survived
train.drop(columns='Survived', inplace=True)

min_max_scaler = MinMaxScaler()
train = min_max_scaler.fit_transform(train)
test = min_max_scaler.transform(test)

In [None]:
# Dictionaries for plotting graph
accuracy = {}
f1 = {}
precision = {}
recall = {}
roc_auc = {}

In [None]:
def add_scores(model, scoring):
    accuracy[model] = scoring['test_accuracy'].mean()
    f1[model] = scoring['test_f1'].mean()
    precision[model] = scoring['test_precision'].mean()
    recall[model] = scoring['test_recall'].mean()
    roc_auc[model] = scoring['test_roc_auc'].mean()

After some preparation work, we can now fit the data into different models. The models considered in this notebook are:
* SVC
* RandomForestClassifier
* LogisticRegression
* DecisionTreeClassifier
* GradientBoostingClassifier

In [None]:
models = [SVC(max_iter=5000000), RandomForestClassifier(), LogisticRegression(solver='liblinear'), DecisionTreeClassifier(), GradientBoostingClassifier()]
model_name = ['SVC', 'Random_Forest', 'Logistic', 'Decision_Tree', 'Gradient_Boosting']

In [None]:
# Try different Models with Cross Validation

for i in range(len(models)):
    model = models[i]
    scores = cross_validate(model, train, y, cv=cv, scoring=scoring)
    add_scores(model_name[i], scores)

We can plot a graph and see the metrics across different models.

In [None]:
fig = make_subplots(
    rows=2, cols=3, subplot_titles=("Accuracy", "F1 Score", "Precision", "Recall", "ROC AUC")
)

fig.add_trace(go.Bar(x=list(accuracy.values()),
                     y=model_name, 
                     marker_color='mediumseagreen',
                     orientation='h',
                          ), row=1, col=1)

fig.add_trace(go.Bar(x=list(f1.values()),
                     y=model_name, 
                     marker_color='mediumseagreen',
                     orientation='h',
                          ), row=1, col=2)

fig.add_trace(go.Bar(x=list(precision.values()),
                     y=model_name, 
                     marker_color='mediumseagreen',
                     orientation='h',
                          ), row=1, col=3)

fig.add_trace(go.Bar(x=list(recall.values()),
                     y=model_name, 
                     marker_color='mediumseagreen',
                     orientation='h',
                          ), row=2, col=1)

fig.add_trace(go.Bar(x=list(roc_auc.values()),
                     y=model_name, 
                     marker_color='mediumseagreen',
                     orientation='h',
                          ), row=2, col=2)

fig.update_xaxes(range=[0.8, 0.9], row=1, col=1)
fig.update_xaxes(range=[0.75, 0.85], row=1, col=2)
fig.update_xaxes(range=[0.75, 0.85], row=1, col=3)
fig.update_xaxes(range=[0.7, 0.8], row=2, col=1)
fig.update_xaxes(range=[0.75, 0.95], row=2, col=2)

fig.update_layout(showlegend=False)
fig.show();


It seems that Gradient Boosting and SVC are good model candidate, without any hyperparameter fitting. Next we will fit these models with different hyperparameters and choose the best tuned model.

# Section 3: Hyperparameter Tuning

We now select the best hyperparameter for different models.

In [None]:
# Empty dictionaries
best_models = {}

In [None]:
param_grid =  {'Gradient_Boosting': {"n_estimators":[5, 50, 250, 500, 1000, 2000],"max_depth":[1,3,5],"learning_rate":[0.001, 0.005, 0.01,]},
               'SVC': {'C': [0.01, 0.025, 0.05, 0.075, 0.1], 'gamma': [1,0.5, 0.1,0.01],'kernel': ['rbf', 'poly', 'sigmoid']},
               'Decision_Tree': {'criterion': ['gini', 'entropy'], 'max_depth': [1, 2, 3, 5, 10, 20, 50, None], 'min_samples_leaf': [2, 3, 5, 10, 20, 50, 100]},
               'Logistic': {'penalty': ['l1', 'l2'], 'C': [0.5, 0.1, 0.05, 0.01]},
               'Random_Forest': {'max_depth': [1, 3, 5, 7], 'min_samples_leaf': [1, 2, 4, 6, 8], 'min_samples_split': [2, 4, 5, 6, 7, 10],
                                 'n_estimators': [5, 10 ,20, 25, 50, 100]}}
for i in range(len(models)):
    model = models[i]
    grid = GridSearchCV(model, param_grid[model_name[i]], cv=cv, scoring='accuracy', verbose=1)
    grid.fit(train, y)
    best_models[model_name[i]] = grid.best_estimator_
    print(grid.best_estimator_)
    print(grid.best_score_)

In [None]:
# Dictionaries for plotting graph
accuracy = {}
f1 = {}
precision = {}
recall = {}
roc_auc = {}

for name, model in best_models.items():
    scores = cross_validate(model, train, y, cv=cv, scoring=scoring)
    add_scores(name, scores)

This is the metrics score after hyperparameter tuning:

In [None]:
fig = make_subplots(
    rows=2, cols=3, subplot_titles=("Accuracy", "F1 Score", "Precision", "Recall", "ROC AUC")
)

fig.add_trace(go.Bar(x=list(accuracy.values()),
                     y=model_name, 
                     marker_color='lightsalmon',
                     orientation='h',
                          ), row=1, col=1)

fig.add_trace(go.Bar(x=list(f1.values()),
                     y=model_name, 
                     marker_color='lightsalmon',
                     orientation='h',
                          ), row=1, col=2)

fig.add_trace(go.Bar(x=list(precision.values()),
                     y=model_name, 
                     marker_color='lightsalmon',
                     orientation='h',
                          ), row=1, col=3)

fig.add_trace(go.Bar(x=list(recall.values()),
                     y=model_name, 
                     marker_color='lightsalmon',
                     orientation='h',
                          ), row=2, col=1)

fig.add_trace(go.Bar(x=list(roc_auc.values()),
                     y=model_name, 
                     marker_color='lightsalmon',
                     orientation='h',
                          ), row=2, col=2)

fig.update_xaxes(range=[0.8, 0.9], row=1, col=1)
fig.update_xaxes(range=[0.75, 0.85], row=1, col=2)
fig.update_xaxes(range=[0.8, 0.9], row=1, col=3)
fig.update_xaxes(range=[0.7, 0.8], row=2, col=1)
fig.update_xaxes(range=[0.75, 0.95], row=2, col=2)

fig.update_layout(showlegend=False)
fig.show();

After some tuning, most of the models perfrom better than before. We will apply all models to the test set and see the results. Also, after some experimenting these models works best.

In [None]:
model_results = []
best_models = {'SVC' : SVC(max_iter=5000000, C=0.05, gamma=0.1, kernel='poly'),
               'Random_Forest': RandomForestClassifier(max_depth=1, min_samples_leaf=2, min_samples_split=5, n_estimators=20),
               'Logistic': LogisticRegression(solver='liblinear', C=0.1, penalty='l2'),
               'Decision_Tree': DecisionTreeClassifier(criterion='gini', max_depth=2, min_samples_leaf=2),
               'Gradient_Boosting': GradientBoostingClassifier(learning_rate=0.001, max_depth=1, n_estimators=2000)}
for name, model in best_models.items():
    model.fit(train, y)
    predicted_survival = model.predict(test)
    model_results.append(predicted_survival)

In [None]:
model_results = np.asarray(model_results)
model_results = model_results.sum(axis=0)

In [None]:
voting_df = pd.DataFrame(model_results, columns=['Vote'])
voting_df.value_counts()

We can now apply voting to mark whether the passenger in test set survived. The passenger is marked as survived if 4 or more models agrees with it. You could experiment with different values but i have tried it and it seems to perform best.

In [None]:
prediction = [1 if count >= 4 else 0 for count in model_results]

In [None]:
result = pd.DataFrame(prediction, columns=['Survived'])
result['PassengerId'] = test_passenger_id
result.set_index('PassengerId', inplace=True)
result.head()
result.to_csv('train_test.csv')