# The Titanic and its survivors

### The dataset that I will be exploring today is the Titanic dataset, which contains information on passengers who were aboard the RMS Titanic during its fatal voyage

The task is to build a machine learning model that can predict as accurately as possible the likelihood of some given passengers to survive. We are supplied some training data with which to train the model on, and some testing data to test the models accuracy.
Before I jump in its probably a good idea to do a little research the historical event in question.

* The RMS Titantic sank in the early hours of the morning on the 15th of April 1912.
* This resulted in an estimated 1500 deaths (between 1490 and 1635 people) out the the 2224 people on board.
* The Titanic only had enough lifeboats to carry about half of those on board. 
* Even if they had all lifeboats available, if they full capacity of the ship was met (3339) there would only be enough lifeboats to save 1/3 of those onboard.

It's obvious that they were not prepared any potential disaster,Thomas E. Bonsall, a historian of the disaster, has commented that the evacuation was so badly organised that "even if they had the number of lifeboats they needed, it is impossible to see how they could have launched them". The lack of lifeboats meant that priority would have to be given to certain people. Our intuitions would tell us women and children would most likely be first and therefore most likely to survive. Lets investigate whether that's true and what other factors played a role in peoples survival.

# Read in the data

I am going to load in the data using Pandas a data manipulation library

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 30)

train = pd.read_csv("/kaggle/input/titanic/train.csv")
test  = pd.read_csv("/kaggle/input/titanic/test.csv")
both = [train,test]

# Preliminary investigation of the data

In [1]:
train.head()

### The contents
* PassengerId - This isn't going to be helpful
* Survived - What we are trying to predict. 0 = No, 1 = Yes
* Pclass - The passenger class. 1 = 1st, 2 = 2nd, 3 = 3rd
* Name - Full name inclding title
* Sex - Male,Female
* Age - In years
* SibSp - The number of siblings or spouses related to this passenger on board
* Parch - The number of parents or children related to this passenger on board
* Ticket - Ticket number. Doesn't appear to be that useful in first glance
* Fare - How much was paid for the ticket
* Cabin - Letter and number indicating the posistion of the passenegr on the ship
* Embarked - C = Cherbourg, Q = Queenstown, S = Southampton

In [1]:
print("-"*15 +"Train"+"-"*15)
print(train.info())
print("-"*15 +"Test"+"-"*15)
print(test.info())

Exlcuding "Survived" (as thats what I am trying to predict in test) we have the same number of columns with the same datatypes. It seems like however there quite a few nulls throughout, especially in cabin.
Lets take a better look at what null values ware present.

# Missing values

In [1]:
percent_missing = train.isnull().mean() *100
missing_value_df = pd.DataFrame({"missing":train.isnull().sum(),
                                 'percent_missing': percent_missing})
print(missing_value_df)

In [1]:
percent_missing = test.isnull().mean() *100
missing_value_df = pd.DataFrame({"missing":test.isnull().sum(),
                                 'percent_missing': percent_missing})
print(missing_value_df)

There is a huge amount of missing data for "Cabin", it might be a good idea to drop this column when it comes to feature engineering but we'll see with further analysis.
We might be able to gleen further insights on the number of missing values using the library msno.

In [1]:
import missingno as msno
msno.matrix(train)

In [1]:
train.describe()

* Out of the 891 people in this dataset only 38% survived, compared to the roughly 32% of the full actual data.
* The ages ranged from as young as 5 months to as old as 80. with the average age being close to 30.
* The minimum fare is 0. This could be due to people sneaking aboard, taking very poor accommodation, or most likely that the crew are part of the dataset.

# Visualisations

Now, we have some intial assumptions on what features might predict survival, such as age, sex etc. Lets explore those, and see what else we can find out.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [1]:
sns.set_style("darkgrid") # Style that Seaborn will use fo the figures

## Sex

In [1]:
sex_surv = train.groupby("Sex").mean()["Survived"]
sex_surv

In [1]:
sex_surv =train.groupby("Sex").mean()["Survived"] 

f_1 = round(sex_surv[0] * 100,1)
f_0 = round((1-sex_surv[0]) * 100,1)
m_1 = round(sex_surv[1] * 100,1)
m_0 = round((1-sex_surv[1]) * 100,1)

m_surv = [m_0,m_1]
f_surv = [f_0,f_1]

male_df = train[train.Sex=="male"]
female_df = train[train.Sex=="female"]

fig,ax = plt.subplots(1,2,sharey=True)

axis_1 = sns.countplot(data=male_df, x="Sex",hue="Survived",ax=ax[0])
axis_2 = sns.countplot(data=female_df, x="Sex",hue="Survived",ax=ax[1])

def percent_label(ax,surv):
    
    for c,p in enumerate(ax.patches[:]):
        h = p.get_height()
        x = p.get_x()+p.get_width()/2.
        ax.annotate(str(surv[c])+"%", xy=(x,h), xytext=(0,4), 
                    textcoords="offset points", ha="center", va="bottom")

percent_label(axis_1,m_surv)
percent_label(axis_2,f_surv)

fig.suptitle("Sex vs survival",y=1.07, fontsize=15)
fig.tight_layout(pad=0)
axis_1.set_ylim(0,500)
axis_1.legend_.remove()
axis_2.set_ylabel("")

for ax in [axis_1,axis_2]:
    ax.set_xlabel("");

* Sex seems to be highly correlated with survival
* 74% of women survived compared to 19% of men
* Women were 4x as likely to survive than men

## Passenger Class

In [1]:
fig,ax = plt.subplots(1,2,figsize=(9,5))

ax[0].set_title("Count of survival",{'fontsize': 13},y=1.01)
ax[1].set_title("Survival likelihood",{"fontsize": 13},y=1.01)

sns.countplot(data=train,x="Pclass",hue="Survived",ax=ax[0])
sns.barplot(data=train,x="Pclass",y="Survived",ax=ax[1]);

* Class correlated with survival
* Greater than 60% chance of survival if the passenger was in 1st class
* Only around 25% of 3rd class passengers lived

## Age

In [1]:
#fig,ax = plt.subplots()

ax = sns.FacetGrid(data=train,col="Survived",height=4)
ax.map(sns.distplot,"Age",bins=16)
ax.fig.suptitle("Age of survival",y=1.1,fontsize=15)
ax.set(xlim=(0,80));

In [1]:
ax = sns.FacetGrid( train,hue ="Survived",height=4,aspect=1.5 )
ax.map(sns.kdeplot, "Age", shade= True )
ax.set(xlim=(0 , train["Age"].max()))
ax.fig.suptitle("Age of survival",y=1.05,fontsize=15)
ax.add_legend();

* Confirms intuition around the idea that children take precendence over adults
* Children under 10 more likely to surive than other ages
* The early twenties seems to be the worst age for surivival

We've seen the features age,sex and class and their correlation with surival individually, now lets see what we notice when we combine these features in a single figure. 

In [1]:
ax = sns.FacetGrid(train, row = "Sex", col = "Pclass", hue = "Survived")
ax.map(plt.hist, "Age",alpha=0.6,edgecolor="none",histtype="stepfilled")
ax.fig.suptitle("Age,sex and class vs survival",y=1.07,fontsize=20)
ax.add_legend();

* 1st class females have the greatest chance of living, age doesn't appear to be that influencial for this group
* Young 3rd class men at the greatest risk of perishing
* Likelihood of surival decreases with passenger class for men, but not for women
* The saying "women and children first" rings true here

## Fare

In [1]:
for df in both:
    df["Fare_bins"] = pd.qcut(df["Fare"],4)

plt.title("Fare vs survival",fontsize=15)
sns.pointplot(data=train,x='Fare_bins',y='Survived');

* As you might expect fare positively correlates with survival
* But is this only to do with the money or is it ultimately to do with class

## Embarked

In [1]:
embarked_df = train[['Embarked', 'Survived']].groupby(['Embarked'],as_index=False).mean().sort_values("Survived",ascending=False)
print(embarked_df)

plt.title("Embarked vs survival",fontsize=15)
sns.barplot(data=train, x="Embarked",y="Survived");

* Cherbourg 55% survival
* The wealth of these ports of embarkation may be a factor in the passengers rate of survival[](http://)

## Class and embarked

In [1]:
ax = sns.FacetGrid(train, col = "Embarked",height=4)
ax.map(sns.pointplot, "Pclass", "Survived", "Sex", order=None,hue_order=None, palette = "deep")
ax.fig.suptitle("Class & Embarked",y=1.05,fontsize=15)
ax.add_legend();

* Regardless of where you embark from a the lower your class the less likely you were to live
* The exception to this is in Queenstown, where the survival rate of men increases slightly from 2nd to 3rd class
* However this may be due to the fact that there were less data points for passengers in 1st or 2nd class from Queenstown

## Fare class survival

In [1]:
ax = sns.FacetGrid(data=train, col="Pclass",hue="Pclass",height=4)
ax.map(sns.pointplot, "Fare_bins","Survived",order=None);

* For 1st and 2nd class passengers paying more money may have led to a greater chance of suriving
* However for 3rd class, it doesnt appear that the fare you paid would be likely to help your survival, perhaps the stigma or position of 3rd class rooms were too great of a detriment

## Class age survival

In [1]:
ax = sns.FacetGrid(data=train, col="Pclass",hue="Survived",height=4,aspect=1.3)
ax.map(sns.kdeplot,"Age",shade=True)
ax.set(xlim=(0 , train['Age'].max()))
ax.add_legend();

* The benefit of being a 1st class citizen seems to outweigh the negative of being an adult
* A greater amount of children survived in classes 2 and 3, perhaps due to the fact that upper classes have less children and so affects the results?

## Title

The name column on its own isn't any use to us, however within the entries are titles which we can extract into a new column

In [1]:
import re

for df in both:
    df["Title"] = df["Name"].apply(lambda x: re.split("(?<=, )(.*?)(?=\.)",x)[1])

    df["Title"] = df["Title"].replace(["Don","Rev","Dr","Major","Sir","Col","Capt","Jonkheer","Lady","the Countess"],"Unique")
    df["Title"] = df["Title"].replace("Dona","Mrs")
    df["Title"] = df["Title"].replace(["Ms","Mlle","Mme"],"Miss")

title_surv = train[['Title', 'Survived']].groupby(['Title'],as_index=False).mean().sort_values("Survived",ascending=False)

sns.barplot(data=train,x="Title",y="Survived",order=["Mrs","Miss","Master","Unique","Mr"]);

* This graph really fits well with our intuitions
* Women rank higher than men, with married women (who probably have a higher social status than those umarried ) above unwedded women.
* Master aka young boys is next,which makes sense as the young are usually preferred in live or death situations
* Those with a unique title e.g. Dr, Rev etc (who are usually men) come second last, probably due to the benefit of their social status
* Lastly regular men having a less than 20% chance of survival

In [1]:
ax = sns.FacetGrid(data=train,col="Pclass",height=4)
ax.map(sns.barplot,"Title","Survived",palette="deep",order=None);

* Class greatly boosts mens liklihood to survive
* Interesting to see only 10% of 2nd class "Miss" survived

## Alone

Two new columns we can make are "Alone" and "Family_size" which use "SibSp" and "Parch"

In [1]:
for df in both:
    df["Alone"] = df.apply(lambda x: 0 if x["SibSp"] + x["Parch"]>=1 else 1,axis=1)
    df["Family_size"] = df.apply(lambda x: x["SibSp"] + x["Parch"] + 1,axis=1)

alone_df = train.groupby("Alone").mean()["Survived"]
print(alone_df)

In [1]:
alone_surv =train.groupby("Alone").mean()["Survived"] 

not_a_1 = round(alone_surv[0] * 100,1)
not_a_0 = round((1-alone_surv[0]) * 100,1)
a_1 = round(alone_surv[1] * 100,1)
a_0 = round((1-alone_surv[1]) * 100,1)

alone_order = [not_a_0,not_a_1,a_0,a_1]
alone_order

ax = sns.countplot(data=train,x="Alone",hue="Survived")

for c,p in enumerate(ax.patches[:]):
        h = p.get_height()
        x = p.get_x()+p.get_width()/2.
        if h != 0:
            ax.annotate(str(alone_order[c])+"%", xy=(x,h), xytext=(0,4), 
                       textcoords="offset points", ha="center", va="bottom")

ax.set_ylim(0,400)
plt.title("Alone vs survival",y=1.01,fontsize=15);

## Who's alone

In [1]:
alone_sex = train.groupby("Sex").mean()["Alone"]
alone_embarked = train.groupby("Embarked").mean()["Alone"]
alone_class = train.groupby("Pclass").mean()["Alone"]

print(alone_sex,"\n")
print(alone_embarked,"\n")
print(alone_class,"\n")

ax = sns.FacetGrid( train,hue ="Alone",height=4,aspect=2)
ax.map(sns.kdeplot, "Age", shade= True)
ax.set(xlim=(0,train.Age.max()))
ax.add_legend()
ax.fig.suptitle("Alone age",y=1.05);

## Alone vs class,sex,embarked

In [1]:
fig,ax = plt.subplots(1,3,figsize=(20,5))
sns.barplot(data=train,x="Sex",y="Survived",hue="Alone",ax=ax[0])
sns.barplot(data=train,x="Pclass",y="Survived",hue="Alone",ax=ax[1])
sns.barplot(data=train,x="Embarked",y="Survived",hue="Alone",ax=ax[2]);

* You were 1.5 times more likely to die if you were alone
* Women who were alone had greater odds than those alone
* It doesn't appear to matter what class you are, if you are alone your chances aren't that highly affected

## Family size

In [1]:
ax = sns.pointplot(data=train,x="Family_size",y="Survived");

In [1]:
cf = sns.FacetGrid(data=train,col="Pclass",height=4,hue="Pclass")
cf.map(sns.pointplot,"Family_size","Survived",order=None)

ef = sns.FacetGrid(data=train,col="Embarked",height=4,hue="Embarked")
ef.map(sns.pointplot,"Family_size","Survived",order=None);

## Highest mortality rate

One last one for fun, let's look for the group with the highest mortality rate by creating a Treemap!

In [1]:
import matplotlib
from matplotlib import rcParams
import squarify

def label(n):
    if n[1]==1:
        return "1st class {0} {1} to {2}\n{3}% Died".format(n[0],int(n[2].left.round()),int(n[2].right.round()),round(n[4]*100))
    elif n[1]==2:
        return "2nd class {0} {1} to {2}\n{3}% Died".format(n[0],int(n[2].left.round()),int(n[2].right.round()),round(n[4]*100))
    elif n[1]==3:
        return "3rd class {0} {1} to {2}\n{3}% Died".format(n[0],int(n[2].left.round()),int(n[2].right.round()),round(n[4]*100))
    
def fix_zeros(n):
    if n == 0:
        return 0.01
    else:
        return n

for df in both:
    df["Age_bins"] = pd.cut(df["Age"],4)

comb_df = train.groupby(["Sex","Pclass","Age_bins"]).mean()["Survived"].dropna().reset_index()

comb_df["Died"] = 1-comb_df["Survived"]

comb_df["Label"] = comb_df.apply(label,axis=1)

comb_df["Survived"] = comb_df["Survived"].apply(fix_zeros)
comb_df["Died"] = comb_df["Died"].apply(fix_zeros)

comb_df = comb_df.sort_values("Died",ascending=False)

plt.figure(figsize=(15,7))

norm = matplotlib.colors.Normalize(vmin=min(comb_df.Survived), vmax=max(comb_df.Survived))
colors = [matplotlib.cm.Blues(norm(value)) for value in comb_df.Died]

squarify.plot(sizes=comb_df["Died"], label=comb_df["Label"], color=colors, alpha=.8,pad=True)
plt.axis('off');

* As we have come to expect, lower class old men fair very poorly in life or death situations

The lower end of the graph becomes very difficult to read, so lets create another Treemap but this time reduce the groups and show the lowest mortality rates.

## Lowest mortality rate

In [1]:
def label_2(n):
    if n[1]==1:
        return "1st class {0}\n{1}% survived".format(n[0],round(n[2]*100))
    elif n[1]==2:
        return "2nd class {0}\n{1}% survived".format(n[0],round(n[2]*100))
    elif n[1]==3:
        return "3rd class {0}\n{1}% survived".format(n[0],round(n[2]*100))
    
comb_df2 = train.groupby(["Sex","Pclass"]).mean()["Survived"].dropna().reset_index()\
                                                        .sort_values("Survived",ascending=False)

comb_df2["Label"] = comb_df2.apply(label_2,axis=1)
comb_df2["Survived"] = comb_df2["Survived"].apply(fix_zeros)

plt.figure(figsize=(10,7))

norm = matplotlib.colors.Normalize(vmin=min(comb_df2.Survived), vmax=max(comb_df2.Survived))
colors = [matplotlib.cm.Blues(norm(value)) for value in comb_df2.Survived]


squarify.plot(sizes=comb_df2["Survived"], label=comb_df2["Label"], color=colors, alpha=.8,pad=False)
plt.axis('off');

### Conclusions
Some of the features that seem to have the greatest affect on surival are:
* Sex
* Pclass
* Age
* Alone
* Fare
* Embarked

# Feature engineering

This is how the data is currently looking. During the visualisation process we created some new features that we could train our model on. We can create some more now.

In [1]:
train.head()

Looking at the cabin column we can see that the values are composed of letters and numbers, after a little Googling it appears that the letters could be the deck and the number could be the room. So we could engineer a new feature by splitting the cabin column into two new columns "Deck" and "Room".

Below are some images of the layout of the ship. These new columns could could have importance in predicting survival as deck and room indicate a persons posistion on the ship,and therefore their potential class or proximity to the impact of the iceberg.

<img src="attachment:Titanic_cutaway_diagram.png" width="400">

In [1]:
from IPython.display import Image
Image("../input/class-layout/titanic class system layout.jpg")

In [1]:
def split_cabin(n):
    try:
        return n[0]
    except TypeError:
        return np.nan
    
def room(n):
    try:
        return n[1:]
    except TypeError:
        return np.nan

for df in both:
    df["Deck"] = df["Cabin"].apply(split_cabin)
    df["Room"] = df["Cabin"].apply(room)

Now we have our two new columns. The values for deck seem to be fine, however when we look at the unique values for room we can see that we have some odd values. There are items in this list that seem to contain multiple rooms. This could potentially be a new feature, as if a passenger has bought multiple rooms,they are likely wealthier and could be of higher class, which we know from our viualisations correlate positively with survival.

In [1]:
train.Deck.unique()

In [1]:
train.Room.unique()

In [1]:
import math
from statistics import mean

def multi_rooms(n):
    try:
        if len(n["Room"])>3:
            return 1
        else:
            return 0
    except:
        return np.nan
    
# Function to fix multiple rooms

def fix_rooms(n):
    try:
        if len(n["Room"])>3:
            r_list = [int(v) for v in re.findall("\d+",n["Room"])] # create a list of the multiple rooms
            av = math.ceil(mean(r_list))                           # Take the average room number
            return str(min(r_list, key=lambda x:abs(x-av)))        # Find the closest real room to the average
        elif len(n["Room"]) == 0:
            return np.nan
        else:
            return n["Room"]
    except TypeError:
        return np.nan
    
for df in both:    
    df["Multi_room"] = df.apply(multi_rooms,axis=1)
    df["Room"] = df.apply(fix_rooms,axis=1)

In [1]:
train.head()

In [1]:
train.Multi_room.value_counts()

In [1]:
train.Room.unique()

# Dealing with missing values

Most machine learning algorithms cannot handle missing or non numerical values, so before we can use them we need to encode the data and impute any nulls.
There are multiple ways of doing this, we can: 
* drop the columns which contain nulls
* Impute missing values using mean,median or mode
* Impute using a ML algortihm
* Add a marker column indicating that a value is missing
* or combinations of above

Lets try a few different ways and see how they play out.

We'll make 4 different versions of the dataframe to test:
* knn_marker - Uses KNNimputer which works by find "k" nearest neighbors and then votes for the most frequent label of those neighbors. And specify "add_indicator=True" which adds a marker column
* Impute - Impute using SimpleImputer (median, mode etc)
* Impute_marker - Impute + marker
* Drop - Drops the columns. We'll define this dataframe later

In [1]:
knn_marker = train.copy() 
    
impute = train.copy()
impute_marker = train.copy() 

impute_test = [knn_marker,impute,impute_marker] # train will be knn without a marker

In [1]:
train["Embarked"].fillna(train["Embarked"].mode()[0],inplace=True) # Fill missing values using the mode as there are very few (2) nulls in this column
test["Fare"].fillna(test["Fare"].median(),inplace=True)

for df in both:
    df["Fare_cats"] = df["Fare_bins"].cat.codes # Use cat.codes to label encode 
    df["Age_cats"] = df["Age_bins"].cat.codes.replace(-1, np.nan) # cat.codes encodes nan as -1 so replace with np.nan
    df.drop(columns=["PassengerId","Name","Age","Ticket","Fare","Cabin","Age_bins","Fare_bins"],inplace=True)

# Encode data

### LabelEncoder
This function takes categorical data and assigns a numerical value to each unique item. For example if you had the the categories "1st Place", "2nd Place", "3rd Place", LabelEncoder would assign values like 0, 1, 2. I decided to use LabelEncoder for these columns as they have an implicit order to them.

In [1]:
Image("../input/encoding/encoding.png")

In [1]:
from sklearn.preprocessing import OrdinalEncoder

le= OrdinalEncoder() # Instantiate estimator
encode_cols = ["SibSp","Parch","Deck","Room"]

def encode(data):
    nonulls = np.array(data.dropna()) # df of non-null values
    impute_reshape = nonulls.reshape(-1,1) # df has to be reshaped before the label encoder can be fit
    impute_ordinal = le.fit_transform(impute_reshape)
    data.loc[data.notnull()] = np.squeeze(impute_ordinal) # np.squeeze removes single-dimensional entries from the shape of an array

for df in both:
    for col in encode_cols:
        encode(df[col])

### OneHotEncoder
OneHotEncoder is simliar to LabelEncoder however it is more useful for data that has no order, such as "Red","Yellow","Blue".
It also differs from LabelEncoder in that instead of creating one new column containing numbers ranging from 0 to n-1 (where n is the length of unique items),it creates n new columns each with either a value of 0 or 1.

In [1]:
Image("../input/encoding/OHE.png")

In [1]:
from sklearn.preprocessing import OneHotEncoder

sub = ["Sex","Embarked","Title"] # Subset of df columns to encode values
ohe = OneHotEncoder(handle_unknown = "ignore",sparse=False) # Instantiate estimator

for i,df in enumerate(both):
    ohe_df = pd.DataFrame(ohe.fit_transform(both[i][sub]))  # Fit and transform OHE to the data then convert to dataframe as it returns an array
    
    ohe_df.columns = ["Female","Male","C","Q","S","Master","Miss","Mr","Mrs","Unique"] # Reassign column names as OHE removes them
    
    both[i] = pd.concat([both[i],ohe_df],axis=1) # Concatentate  the two dataframes
    
    both[i].drop(columns=["Sex","Embarked","Title"],inplace=True)

# Impute missing data
To fill in the missing values for the remaining columns I'm going to use KNNImputer which imputes using KNN.

In [1]:
from sklearn.impute import KNNImputer

knn_imp = KNNImputer(n_neighbors=5,weights="distance")

for i,df in enumerate(both):
    both[i] = pd.DataFrame(knn_imp.fit_transform(both[i]),columns = both[i].columns)

The final step is going to be scaling our data. Depending on which algorithm we use the inconsistent scales of our data will lead to certain features having a much greater effect than others.

In [1]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()

for i,df in enumerate(both):
    d = mm.fit_transform(both[i])
    both[i] = pd.DataFrame(d,columns=both[i].columns)
    
train = both[0]
test = both[1]

Reassign our data to the standard variable nomenclature X, (features) y (target).

In [1]:
X = train.drop(columns="Survived")
y = train["Survived"]

Our final fully wrangled data is ready.

In [1]:
X.head()

We'll repeat the previous steps for the other imputation methods that we decided on earlier.

In [1]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

for df in impute_test:
    df["Fare_cats"] = df["Fare_bins"].cat.codes 
    df["Age_cats"] = df["Age_bins"].cat.codes.replace(-1, np.nan)
    df["Embarked"].fillna(df["Embarked"].mode()[0],inplace=True)

mode = ["Deck","Room","Multi_room"]
median = ["Age_cats"]

c = mode+median+list(impute.columns) # impute Column names
c2 = mode+[n+"_missing" for n in mode]+median+[n+"_missing" for n in median]+list(impute_marker.columns) # Impute_marker column names

trans = [("mode",SimpleImputer(strategy="most_frequent"),mode),("median",SimpleImputer(strategy="median"),median)] # Estimators that we pass to our column transfomer
trans_m = [("mode",SimpleImputer(strategy="most_frequent",add_indicator=True),mode),("median",SimpleImputer(strategy="median",add_indicator=True),median)]

t = ColumnTransformer(transformers=trans,remainder="passthrough")
t_m = ColumnTransformer(transformers=trans_m,remainder="passthrough")

impute_test[1] = pd.DataFrame(t.fit_transform(impute),columns=list(dict.fromkeys(c)))
impute_test[2] = pd.DataFrame(t_m.fit_transform(impute_marker),columns=list(dict.fromkeys(c2)))

for df in impute_test:
    for col in encode_cols:
        encode(df[col])
               
for i,df in enumerate(impute_test):
    ohe_df = pd.DataFrame(ohe.fit_transform(impute_test[i][sub]))
    ohe_df.columns = ["Female","Male","C","Q","S","Master","Miss","Mr","Mrs","Unique"]
    impute_test[i] = pd.concat([impute_test[i],ohe_df],axis=1)
    
for df in impute_test:
    df.drop(columns=["PassengerId","Survived","Name","Ticket","Sex","Age","Fare","Embarked","Title","Age_bins","Fare_bins","Cabin"],inplace=True)
    
knn_imp = KNNImputer(n_neighbors=5,weights="distance",add_indicator=True)
impute_test[0] = pd.DataFrame(knn_imp.fit_transform(impute_test[0]),columns = list(impute_test[0].columns) + ["Deck_missing","Room_missing","Multi_room_missing","Age_cats_missing"])
    
for i in range(len(impute_test)):
    d = mm.fit_transform(impute_test[i])
    impute_test[i] = pd.DataFrame(d,columns=impute_test[i].columns)
    
knn_marker = impute_test[0]
impute = impute_test[1]
impute_marker = impute_test[2]

drop = X.drop(columns=["Deck","Room","Multi_room"])

imp_dict = {"KNN":X,"KNN + marker":knn_marker,"SimpleImpute":impute,"SimpleImpute + marker":impute_marker,"Drop":drop}

# Baseline performance
Let's choose some various alogrithms that we think might be relevant for this classification problem, loop through them and created at table based on their performances.

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

algs = [RandomForestClassifier(random_state=0),
       GaussianNB(),
       KNeighborsClassifier(),
       SVC(probability=True,random_state=0),
       XGBClassifier(random_state=0),
       LogisticRegression(random_state=0),
       DecisionTreeClassifier(random_state=0)]

In [1]:
from sklearn.model_selection import cross_validate,ShuffleSplit,StratifiedKFold

shuf = ShuffleSplit(n_splits=10,test_size=0.25,train_size=0.75,random_state=0)
stratk = StratifiedKFold(n_splits=10)
baseline_df = pd.DataFrame()

row=0
for a in algs:
    
    results = cross_validate(a,X,y,cv=stratk,return_train_score=True)
    
    name = a.__class__.__name__
    baseline_df.loc[row,"Name"] = name
    baseline_df.loc[row,"Feature_count"] = len(X.columns)
    baseline_df.loc[row,"Train_score"] = results["train_score"].mean()
    baseline_df.loc[row,"Test_score"] = results["test_score"].mean()
    baseline_df.loc[row,"Time"] = results["fit_time"].mean()
    row+=1
    
baseline_df.sort_values("Test_score",ascending=False)

From this inital test we can see that decision tree algorithms seem to perform better, but also overfit the data to a large degree, more so than the others.

There are different cross validation methods that we can use, lets see if any stand out.

In [1]:
cv_methods_df = pd.DataFrame()
cvs = {"ShuffleSplit":ShuffleSplit(n_splits=10,test_size=0.25,train_size=0.75,random_state=0),
       "StratifiedKFold":StratifiedKFold(n_splits=10)} 

rfc = RandomForestClassifier(random_state=0)
i=0
for k,cv in cvs.items():
       
    results = cross_validate(rfc,X,y,cv=cv,return_train_score=True)
    
    name = list(cvs.keys())[i]
    cv_methods_df.loc[i,"cv_method"] = name
    cv_methods_df.loc[i,"Train_score"] = results["train_score"].mean()
    cv_methods_df.loc[i,"Test_score"] = results["test_score"].mean()
    cv_methods_df.loc[i,"Time"] = results["fit_time"].mean()
    i+=1
        
cv_methods_df.sort_values("Test_score",ascending=False)

Now let's see how our various imputations method ultimately performed

In [1]:
imp_methods = pd.DataFrame()
i=0
for a in algs:
    n=0
    for k,df in imp_dict.items():

        results = cross_validate(a,df,y,cv=stratk,return_train_score=True)

        name = list(imp_dict.keys())[n]
        imp_methods.loc[i,"Method"] = name
        imp_methods.loc[i,"Algorithm"] = a.__class__.__name__
        imp_methods.loc[i,"Test_score"] = results["test_score"].mean()
        imp_methods.loc[i,"Time"] = results["fit_time"].mean()
        i+=1
        n+=1
    
imp_methods.groupby("Algorithm").apply(lambda x: x.sort_values("Test_score", ascending=False)).drop(columns="Algorithm")

KNN looks to have have an edge compared to the other options.

# Feature elmination methods
Choosing the the right features and the right amount of them is an important stage to get right in ML. As having too few features means you dont know enough about your data and too many means that your model will be worse at generalising for unseen data. It will have a high train score but a much lower test score.

### Different feature selection methods inlcude:

**Variance thresholds**
* Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold). These features provide     little value.

**Correlation thresholds**
* Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information

**PCA**
* Is a dimensionality reduction technique that aims to find the directions of maximal variance,it is unsupervised so ignores class labels

**LDA**
* Similar to to PCA, LDA is a linear transformation technique, however LDA attempts to find a feature subspace that maximizes class separability, and requires labled data.

**Recursive Feature elimination**
* Searches for the best combination of features and removes ones that dont perform well.

I'm going to explore PCA,LDA and RFECV, so lets take a look at how these methods peform.
Only decision tree algorithms support RFECV so for ones that dont I set the columns to use as "None".

In [1]:
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate,ShuffleSplit
from sklearn.feature_selection import RFECV

FSM = ["None","RFECV","PCA","LDA"]
performance_df = pd.DataFrame()

alg_row=0
fsm_row=0

for a in algs:
    
    for fsm in FSM:
        
        if fsm == "None":
            x = X
            
        elif fsm == "RFECV":
            try:
                rfe = RFECV(a,scoring="roc_auc")
                x = rfe.fit_transform(X,y)   

            except RuntimeError:
                x = X
                
        elif fsm=="PCA":
            pca = PCA(n_components=0.95,random_state=0)
            x = pca.fit_transform(X)
            
        elif fsm=="LDA":
            lda = LinearDiscriminantAnalysis()
            x = lda.fit_transform(X,y)
    
        results = cross_validate(a,x,y,cv=stratk,return_train_score=True)

        performance_df.loc[fsm_row,"Name"] = a.__class__.__name__
        performance_df.loc[fsm_row,"FSM"] = fsm
        performance_df.loc[fsm_row,"Feature_count"] = x.shape[1]
        #performance_df.loc[fsm_row,"Train_score"] = results["train_score"].mean()
        performance_df.loc[fsm_row,"Test_score"] = results["test_score"].mean()
        performance_df.loc[fsm_row,"Test_improvement"] = results["test_score"].mean() - baseline_df.loc[alg_row,"Test_score"]
        performance_df.loc[fsm_row,"Time"] = results["fit_time"].mean()
        fsm_row+=1
        
    alg_row+=1
    
performance_df.groupby("Name").apply(lambda x: x.sort_values("Test_score", ascending=False)).drop(columns="Name")

It seems to be that either None or RFECV perform the best.

I think I'll go with RFCEV.

# Hyper-parameter tuning
The estimators that we choose have lots of different parameters to change that will affect our models performance. We can search for the optimal combination by using sklearn's GridSearchCV which takes a grid of paramters and tries every combination of them until it finds the combination that peforms the best. This is a very computationally expensive process and will take a long time to do. 

In [1]:
from sklearn.model_selection import GridSearchCV

randomforest = {"n_estimators":[100,300,500],
                "criterion":["gini","entropy"],
                "max_depth":[2,6,10,None],
                "random_state":[0],
                "max_features":["auto","sqrt","log2"],
               "min_samples_split":[2,5,10],
               "min_samples_leaf":[1,5,10]}
guassianNB = {}

Kneighbors = {"n_neighbors":[5,10,15],
              "leaf_size":[30,35,40],}

svc = {"gamma":["scale","auto"],
      "kernel":["linear", "poly", "rbf", "sigmoid"]}

xgb = {"max_depth":[2,4,6,8,10,None],
      "random_state":[0]}

logisticregression = {"penalty":["l1", "l2", "elasticnet", "none"],
                     "random_state":[0],
                     "solver":["newton-cg","lbfgs","liblinear","sag","saga"]}

decisiontree = {"criterion":["gini","entropy"],
               "splitter":["best","random"],
               "max_depth":[2,6,10,None],
               "max_features":["auto","sqrt","log2"],
               "random_state":[0],
               "min_samples_split":[2,5,10],
               "min_samples_leaf":[1,5,10]}

params =[randomforest,guassianNB,Kneighbors,svc,xgb,logisticregression,decisiontree]

n=0
alg_params={}
while n<7:
    for a in algs:
        h_params = GridSearchCV(a,param_grid=params[n],scoring='roc_auc',cv=3,n_jobs=-1)
        h_params.fit(X,y)
        alg_params[a.__class__.__name__] = h_params.best_params_
        n+=1
        
alg_params

Now that we have our new paramters lets have a look at how our model currently performs.

In [1]:
rfc_best = RandomForestClassifier(**alg_params["RandomForestClassifier"])
nb_best = GaussianNB()
knn_best = KNeighborsClassifier(**alg_params["KNeighborsClassifier"])
svc_best = SVC(probability=True,**alg_params["SVC"])
xgb_best = XGBClassifier(**alg_params["XGBClassifier"])
lr_best = LogisticRegression(**alg_params["LogisticRegression"])
dt_best = DecisionTreeClassifier(**alg_params["DecisionTreeClassifier"])

best_algs = [rfc_best,nb_best,knn_best,svc_best,xgb_best,lr_best,dt_best]

rfe = RFECV(a,scoring="roc_auc")
rfe.fit(X,y)
rfe_cols = X.columns[rfe.support_]

final = pd.DataFrame()
row=0
for a in best_algs:
    
    results = cross_validate(a,X[rfe_cols],y,cv=stratk,return_train_score=True)
    
    name = a.__class__.__name__
    final.loc[row,"Name"] = name
    final.loc[row,"Feature_count"] = len(X[rfe_cols].columns)
    final.loc[row,"Train_score"] = results["train_score"].mean()
    final.loc[row,"Test_score"] = results["test_score"].mean()
    final.loc[row,"Time"] = results["fit_time"].mean()
    row+=1
    
final.sort_values("Test_score",ascending=False)

# Voting classifier
We don't have try rely on just one model we can take the opinion of various ones using sklearn's VotingCLassifier. A collection of several models working together on a single set is called an ensemble, it works by combining the predictions from multiple machine learning algorithms.  The final output on a prediction is taken according to two different voting strategies, hard or soft. Choosing the voting paramter "hard" will take the majority vote of some predictions. Choosing "soft" will take into account the various probalities of each predcition.

In [1]:
from sklearn.ensemble import VotingClassifier

rfe = RFECV(a,scoring="accuracy")
rfe.fit(X,y)
rfe_cols = X.columns[rfe.support_]

est = [('rfc', rfc_best), ('xgb', xgb_best),
('knn', knn_best), ('svc',svc_best),('lr',lr_best),("nb",nb_best),("dt",DecisionTreeClassifier())]

vc_hard = VotingClassifier(estimators=est,voting='hard')
vc_hard_cv = cross_validate(vc_hard,X[rfe_cols],y,cv=stratk,return_train_score=True)
vc_hard.fit(X[rfe_cols], y)
print("Voting classifier hard",vc_hard_cv['test_score'].mean())

vc_soft = VotingClassifier(estimators=est,voting='soft')
vc_soft_cv = cross_validate(vc_soft,X[rfe_cols],y,cv=stratk,return_train_score=True)
vc_soft.fit(X[rfe_cols], y)
print("Voting classifier soft",vc_soft_cv['test_score'].mean())

Our final score appears to be around 87%

In [1]:
test_s = pd.read_csv("/kaggle/input/titanic/test.csv")
test_s['Survived'] = vc_soft.predict(test[rfe_cols]).astype(int)

submission = test_s[['PassengerId','Survived']]
submission.to_csv("../working/submission.csv", index=False)