<h0><center><font size='6'>Titanic Dataset - Take 5</font></center></h0>

___

Table Of Contents
=

______

- <a href='#Link-SecA'>Introduction</a>
- <a href='#Link-SecB'>Exploratory Data Analysis</a>
    - <a href='#Link-SecB-a'>Training dataset</a>
    - <a href='#Link-SecB-b'>Checking for missing values</a>
    - <a href='#Link-SecB-c'>Target feature: Survival</a>
    - <a href='#Link-SecB-d'>Sex vs Survival rate</a>
    - <a href='#Link-SecB-e'>Pclass vs Survival rate</a>
    - <a href='#Link-SecB-f'>Parch vs Survival rate</a>
    - <a href='#Link-SecB-g'>SibSp vs Survival rate</a>
    - <a href='#Link-SecB-h'>Combining Parch and SibSp</a>
    - <a href='#Link-SecB-i'>Summarizing demographics analysis</a>
    - <a href='#Link-SecB-j'>Embarked vs Survival rate</a>
    - <a href='#Link-SecB-k'>Fare vs Pclass</a>
    - <a href='#Link-SecB-l'>Cabin vs Survival rate</a>
    - <a href='#Link-SecB-m'>Name vs Survival rate</a>
- <a href='#Link-SecB-n'>Test dataset</a>
    - <a href='#Link-SecB-o'>Checking for missing values</a>
- <a href='#Link-SecC'>Feature engineering and preparation</a>
    - <a href='#Link-SecC-a'>Cabin feature</a>
    - <a href='#Link-SecC-b'>Age feature</a>
    - <a href='#Link-SecC-c'>Embarked feature </a> 
    - <a href='#Link-SecC-d'>Fare feature </a>
    - <a href='#Link-SecC-e'>Name feature - exctracting title </a>
    - <a href='#Link-SecC-f'>Parch and SibSp - The family </a>
    - <a href='#Link-SecC-g'>Tidying up</a>
- <a href='#Link-SecD'>Machine Learning</a>
    - <a href='#Link-SecD-a'>Preprocessing</a>
    - <a href='#Link-SecD-b'>Model Training</a>
    - <a href='#Link-SecD-h'>Model Evaluation </a>
    - <a href='#Link-SecD-i'>Kaggle Submission </a>

---

# <a id='Link-SecA'>Introduction</a>

___

Hello all!

Sharing with you all my take on the classic Titanic dataset using different ML models: `Logistic Regression`, `Decision Trees`, `Random Forest`, `K-Nearest Neighbors` and `Extreme Gradient Boosting Classifier`.
    
This notebook was built after a lot of study and help from the kaggle community and courses. Feel free to copy the code to study and develop your own knowledge on ML models. If you have any comments do not hesitate to contact me.

___

# <a id='Link-SecB'>Exploratory Data Analysis</a>

___

In [None]:
# importing basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## <a id='#Link-SecB-a'>Training dataset</a>

In [None]:
train = pd.read_csv('../input/titanic/train.csv',index_col='PassengerId')

In [None]:
train.head()

In [None]:
print(f'There are {train.shape[0]} rows in the data frame')
print(f'There are {train.shape[1]} columns in the data frame')

In [None]:
train.info()

### <a id='Link-SecB-b'>Checking for missing values</a> 

In [None]:
train.isnull().values.any()

In [None]:
train.isnull().sum()

In [None]:
def missing_value(df):
    number = df.isnull().sum().sort_values(ascending=False)
    number = number[number > 0]
    percentage = df.isnull().sum() *100 / df.shape[0]
    percentage = percentage[percentage > 0].sort_values(ascending=False)
    return  pd.concat([number,percentage],keys=["Total","Percentage"],axis=1)

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis');

Three features in the train data set with missing values were found: `Cabin` (77% missing), `Age` (20% missing) and `Embarked` (only 2 observations missing). They will be investigated further in the feature engineering section.

In [None]:
missing_value(train)

### <a id='Link-SecB-c'>Target feature: Survival</a>  

We will start with a check of the `Survived` column, a.k.a the target feature we will be trying to predict with the ML algorithms. A good start is with looking at the balance between the two classes in this feature.

In [None]:
surv=train['Survived'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Survived','Survived':'percent'})

In [None]:
surv.iloc[0,0] = 'No'
surv.iloc[1,0] = 'Yes'

In [None]:
plt.figure(figsize=(10,6))
sns.set_style('darkgrid')
sns.set_context('paper',font_scale=1.5)

plt.suptitle('Survival probability')
sns.barplot(x='Survived',y='percent',data=surv);

Over 60% of individuals didn't survived in the train data set. 

Next we will look at the survival rate in the different features.

### <a id='Link-SecB-d'>Sex vs Survival rate</a>

In [None]:
sex=train.groupby('Sex')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
sex

In [None]:
sex_dist = train['Sex'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Sex','Sex':'percent'})

In [None]:
sex_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)
plt.suptitle('Sex feature distribution')
axs[0].set_title('Passengers per Sex')
axs[1].set_title('Survival rate per Sex')

sns.barplot(ax=axs[0],x='Sex',y='percent',data=sex_dist)
sns.barplot(ax=axs[1],x='Sex',y='percent',hue='Survived',data=sex);

From the figure above becomes clear that male passengers in the train dataset represented over 60% of the individuals. Most of them did not survive the disaster (80%). It is also intersting to see in the figure below the age distribution of males and females. It is possible to note that passengers over 60 years were mainly males and that the distribution between classes of ages 0 to 60 are similar for both sex.

In [None]:
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=2)

grid = sns.FacetGrid(data=train,
                     col='Sex',
                     margin_titles=True,height=5,aspect=1.6,legend_out=True,despine=False)

grid.map(plt.hist,'Age',alpha=0.5,bins=30)
grid.add_legend();

Now, if we split the age distribution one more level and look at the them across the `Survived` label, it seems that most male passengers over 60 did not survive the disaster.

In [None]:
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

grid = sns.FacetGrid(data=train,
                     col='Survived',
                     row='Sex',
                     margin_titles=True,height=5,aspect=1.5,legend_out=True,despine=False)

grid.map(plt.hist,'Age',alpha=0.5,bins=30)
grid.add_legend();

### <a id='Link-SecB-e'>Pclass vs Survival rate</a>

In [None]:
pclass=train.groupby('Pclass')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
pclass

In [None]:
pclass_dist = train['Pclass'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Pclass','Pclass':'percent'})

In [None]:
pclass_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Pclass feature distribution')
axs[0].set_title('Passengers per Pclass')
axs[1].set_title('Survival rate per Pclass')

sns.barplot(ax=axs[0],x='Pclass',y='percent',data=pclass_dist);
sns.barplot(ax=axs[1], x='Pclass',y='percent',data=pclass,hue='Survived');

Most of passenger were traveling in third class (~55%) and more than 75% of them did not survived. First class passengers were the ones with highes chance of survival (over 50%) while passengers on second class had similar chances of survival.

We could now add `Age` to the mix to investigate age structure in the different `Pclasses` to build a bit more out intuition on how the different features interact to each other.

In [None]:
sns.set_style('darkgrid')

grid = sns.FacetGrid(data=train,
                     #col='Survived',
                     col='Pclass',
                     margin_titles=True,height=5,aspect=1,legend_out=True,despine=False)

grid.map(plt.hist,'Age',alpha=0.5,bins=30)
grid.add_legend();

Most of youngsters were traveling on third class while the older passengers seem to be concentrated in first class. At leas we can't really see much passengers less than 20 years in first class as much as second and third.

Building up to an analysis of the `Age` sctructure per `Pclass` against `Survived` to comfirm the trend. Again, third classe with higher percetage of youngsters that did not survive can be inferred from the figure below.

In [None]:
sns.set_style('darkgrid')

grid = sns.FacetGrid(data=train,
                     col='Survived',
                     row='Pclass',
                     margin_titles=True,height=5,aspect=1.5,legend_out=True,despine=False)

grid.map(plt.hist,'Age',alpha=0.5,bins=20)
grid.add_legend();

Perhaps we should also look at the proportion of males and females per `Pclass`.

In [None]:
sex_pclass=train.groupby('Pclass')['Sex'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
sex_pclass

In [None]:
fig, axs = plt.subplots(figsize=(6,5))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Sex proportion per Pclass')

sns.barplot(x='Pclass',y='percent',hue='Sex',data=sex_pclass,ci=False)
plt.legend(bbox_to_anchor=(1.1,1));


As expected most of males were traveling on third class.

### <a id='Link-SecB-f'>Parch vs Survival rate</a>

Continuing with our 'demographics' exploration we will have a look at `Parch` feature that contains the number of parents and children also on board the Titanic for each passenger.

In [None]:
parch=train.groupby('Parch')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
parch

In [None]:
parch_dist = train['Parch'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Parch','Parch':'percent'})

In [None]:
parch_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Parch feature distribution')
axs[0].set_title('Passengers per Parch')
axs[1].set_title('Survival rate per Parch')

sns.barplot(ax=axs[0],x='Parch',y='percent',data=parch_dist);
sns.barplot(ax=axs[1], x='Parch',y='percent',data=parch,hue='Survived');

More than 75% of the passengers had no parents nor children on board the Titanic and over 65% of them did not surivive. However it is important to note that passengers with `Parch` over 4 had even lower survival rate.

### <a id='Link-SecB-g'>SibSp vs Survival rate</a>

In [None]:
sibsp=train.groupby('SibSp')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
sibsp

In [None]:
sibsp_dist = train['SibSp'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'SibSp','SibSp':'percent'})

In [None]:
sibsp_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('SibSp feature distribution')
axs[0].set_title('Passengers per SibSp')
axs[1].set_title('Survival rate per SibSp')

sns.barplot(ax=axs[0],x='SibSp',y='percent',data=sibsp_dist);
sns.barplot(ax=axs[1], x='SibSp',y='percent',data=sibsp,hue='Survived');

Following the trend observed in the `Parch` feature, it seems most of the passengers also didn't have either spouse or siblins on board the Titanic and the survival rate of those with 3 or more `Parch` was also extremely low. 

### <a id='Link-SecB-h'>Combining Parch and SibSp</a>

As a test we will use the features `Parch`and `SibSp` and unite them in a new feature called `Family` where we will classify passengers as travelling with family members or not.

In [None]:
def sib_family(row):
    if row['SibSp'] > 0:
        return 1
    else:
        return 0
train['SibSp'] = train.apply(lambda row: sib_family(row),axis=1)

In [None]:
def par_family(row):
    if row['Parch'] > 0:
        return 1
    else:
        return 0
train['Parch'] = train.apply(lambda row: par_family(row),axis=1)

In [None]:
train.head()

In [None]:
train['Family'] = train['SibSp'] + train['Parch']

In [None]:
train.head()

In [None]:
def family(row):
    if row['Family'] > 0:
        return 1
    else:
        return 0
train['Family'] = train.apply(lambda row: family(row),axis=1)

In [None]:
train.head()

In [None]:
train['Family'].value_counts()

In [None]:
fam_dist = train['Family'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Family','Family':'percent'})

In [None]:
fam_dist

In [None]:
fam = train.groupby('Family')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
fam

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Family feature distribution')
axs[0].set_title('Passengers per Family')
axs[1].set_title('Survival rate per Family')

sns.barplot(ax=axs[0],x='Family',y='percent',data=fam_dist);
sns.barplot(ax=axs[1], x='Family',y='percent',data=fam,hue='Survived');

Excellent. Now we know that "loners" had 60% chance of not surviving while passengers traveling with family had equal chance of survivng.

### <a id='Link-SecB-i'>Summarizing demographics analysis</a>

Some good intuition is already built from the data exploration on the demographics of Titanic passengers done so far. 
Before we continue let's summarize what've lerned so far:

- Most of passengers did not survived
- Most of passengers were men 
- Most of passengers were young under 40 years old (men and women)
- Most of passengers were traveling in 3rd class (men and women)
- Most passemger were traveling alone
- Chance of survival seem to be related to number of family members traveling together

Now we will look at the `Embarkaed` feature, representing the ports were the passengeres embarqued Titanic: Cherbourg (C), Queenstown (Q), Southampton (S)

### <a id='Link-SecB-j'>Embarked vs Survival rate</a>

In [None]:
embkd = train.groupby('Embarked')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
embkd

In [None]:
embkd_dist = train['Embarked'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Embarked','Embarked':'percent'})

In [None]:
embkd_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Embarked feature distribution')
axs[0].set_title('Passengers per Embarked')
axs[1].set_title('Survival rate per Embarked')

sns.barplot(ax=axs[0],x='Embarked',y='percent',data=embkd_dist);
sns.barplot(ax=axs[1], x='Embarked',y='percent',data=embkd,hue='Survived');

Most passengers  got on board in Southampton and passengers that embarked in Cherbourg had higher survival rate.

### <a id='Link-SecB-k'>Fare vs Pclass</a>

Last but not least, the `Fare` feature will be investigated. We want to look into the relationship between fare and `Pclass` as I would expect some sort of pricing difference.

In [None]:
fare = train.groupby('Pclass')['Fare'].describe()

In [None]:
fare.transpose()

In [None]:
sns.set_style('darkgrid')

grid = sns.FacetGrid(data=train,
                     #row='Embarked',
                     col='Pclass',
                     margin_titles=True,height=5,aspect=1,legend_out=True,despine=False)

grid.map(plt.hist,'Fare',alpha=.5,bins=20)
grid.add_legend();

It is clear that there a significant price difference for the different classes onboard and it seems that the features `Fare`and `Pclass` are redundant.

### <a id='Link-SecB-l'>Cabin vs Survival rate</a>

Looking now at the `Cabin` feature where 77% of the data is missing. 

In [None]:
train['Cabin'].unique()

In [None]:
# Fill na with letter 'U' from Unknown to be used for analysis against Survived
train['Cabin'].fillna('U',inplace=True)

In [None]:
cabin_letter = train.copy()

In [None]:
cabin_letter['cabin_letter'] = cabin_letter['Cabin'].str[0]

In [None]:
c_letter = cabin_letter.groupby('cabin_letter')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
c_letter

In [None]:
c_letter_dist = cabin_letter['cabin_letter'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'cabin_letter','cabin_letter':'percent'})

In [None]:
c_letter_dist

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Cabins feature distribution')
axs[0].set_title('Passengers per Cabin class')
axs[1].set_title('Survival rate per Cabin class')

sns.barplot(ax=axs[0],x='cabin_letter',y='percent',data=c_letter_dist);
sns.barplot(ax=axs[1], x='cabin_letter',y='percent',data=c_letter,hue='Survived');

It seems `Cabin` is a good candidate to be also engineered into 2 single classes (known and unkown) since it seems that passagers with known cabins had more change of surviving.

### <a id='Link-SecB-m'>Name vs Survival rate</a>

Intuitively your name should not increaseor decrease your chance of survival but the title in the name might be of some use. We will extract the title and check it against the survival rate.

In [None]:
# check unique titles
train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False).value_counts()

In [None]:
c = train['Name']

In [None]:
c = c.tolist()

In [None]:
d = []

def ttle(name):
    for i in c:
        if 'Mr.' in i:
            d.append('Mr')
        elif 'Dr.' in i:
            d.append('Mr')
        elif 'Rev.' in i:
            d.append('Mr')
        elif 'Col.' in i:
            d.append('Mr')
        elif 'Capt.' in i:
            d.append('Mr')
        elif 'Major.' in i:
            d.append('Mr')
        elif 'Don.' in i:
            d.append('Mr')
        elif 'Sir.' in i:
            d.append('Mr')
        
        elif 'Master.' in i:
            d.append('Master')
        
        elif 'Miss.' in i:
            d.append('Miss')
        elif 'Mlle.' in i:
            d.append('Miss')
        elif 'Mme.' in i:
            d.append('Miss')
        elif 'Ms.' in i:
            d.append('Miss')
            
        elif 'Mrs.' in i:
            d.append('Mrs')
        elif 'Countess.' in i:
            d.append('Mrs')
        elif 'Lady.' in i:
            d.append('Mrs')
        elif 'Dona.' in i:
            d.append('Mrs')
            
ttle(c)

In [None]:
d2 = pd.Series(d)

In [None]:
d2.value_counts()

In [None]:
train['Title'] = d2

In [None]:
train.head()

In [None]:
title_dist = train['Title'].value_counts(normalize=True).mul(100).reset_index().rename(columns={'index': 'Title','Title':'percent'})

In [None]:
title_dist

In [None]:
title = train.groupby('Title')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()


In [None]:
title

In [None]:
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(20,6))
sns.set_style('darkgrid')
#sns.set_context('paper',font_scale=1)

plt.suptitle('Title feature distribution')
axs[0].set_title('Passengers per Title')
axs[1].set_title('Survival rate per Title')

sns.barplot(ax=axs[0],x='Title',y='percent',data=title_dist);
sns.barplot(ax=axs[1], x='Title',y='percent',data=title,hue='Survived');

The survival rate across different titles is surprinsingly balanced. We will later drop the name and keep only the title as a engineered feature.

In [None]:
train.head()

We will now have a look at the corrrelation between categorical features to try to improve our understanding on how they can 'explain' each other and help us make decisions on which one we should keep, engineer or drop.

In [None]:
pip install  dython

In [None]:
from dython.nominal import associations

In [None]:
# sns.set_context('paper',font_scale=1.2)

cat_cols = ['Survived', 'Pclass', 'Sex', 'SibSp',
            'Parch', 'Cabin', 'Embarked']

fig1 = associations(train.drop(['Name','Ticket','Age','Fare','Family','Cabin','Title'],axis=1),
                    figsize=(14,7),
                    theil_u=True, nominal_columns=cat_cols,
                    #title='Fig. 1. Associations between features.',
                    mark_columns=True);

Let's have a quick look at the test data set for missing data only.

## <a id='Link-SecB-n'>Test dataset</a>

In [None]:
test = pd.read_csv('../input/titanic/test.csv')
ids_test = test['PassengerId'].values

In [None]:
test.head()

In [None]:
print(f'There are {test.shape[0]} rows in the data frame')
print(f'There are {test.shape[1]} columns in the data frame')

In [None]:
test.info()

### <a id='Link-SecB-o'>Checking for missing values</a>

In [None]:
test.isnull().values.any()

In [None]:
test.isnull().sum()

In [None]:
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Cabin, Fare and Age have missig values. That will be rectified in the next section

_____

# <a id='Link-SecC'>Feature engineering and preparation</a>

___

In [None]:
# Combining test and train datasets and setting
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
full_set = pd.concat([train,test],axis=0).reset_index(drop=True)

In [None]:
# Set 'PassengerId' as index for future split
full_set.set_index(keys='PassengerId',inplace=True)

In [None]:
full_set.tail()

Function to check missing values and return a data frame with a summary

In [None]:
def missing_value(df):
    number = df.isnull().sum().sort_values(ascending=False)
    number = number[number > 0]
    percentage = df.isnull().sum() *100 / df.shape[0]
    percentage = percentage[percentage > 0].sort_values(ascending=False)
    return  pd.concat([number,percentage],keys=["Total","Percentage"],axis=1)
missing_value(full_set.drop('Survived',axis=1))

### <a id='Link-SecC-a'>Cabin feature</a>

We decided to group the cabins into 'known' and 'unknown' classes:
- all uknown cabin data will be 0
- all known cabin data will be 1

In [None]:
full_set['Cabin'].fillna(0,inplace=True)

In [None]:
def cab_map(row):
    if row['Cabin'] != 0:
        return 1
    else:
        return row['Cabin']

In [None]:
full_set['Cabin'] = full_set.apply(lambda row: cab_map(row),axis=1)

In [None]:
full_set['Cabin'].head()

### <a id='Link-SecC-b'>Age feature</a>

From the EDA above we built an intuiton that age seem to be well connected to the different `Pclass` on board Titanic maybe as a result of the difference in the ticket price across the classes. One could assume than younger people would have less money than older, better established man, and therefore would travel on third class instead. We decided to give another check to confirm this theory. See below:

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='Pclass',y='Age',data=full_set);


Above we can see the boxplot distribution of `Age` per `Pclass` and it does seem that the average age distribution is different for each class.

Let's have a look at a kde plot to double check the change:

In [None]:
fig, axs = plt.subplots(figsize=(10,6))

sns.kdeplot(full_set['Age'][full_set['Pclass'] == 1],legend=True,label='Pclass 1 - mean 39 yrs',shade=True)
sns.kdeplot(full_set['Age'][full_set['Pclass'] == 2],legend=True,label='Pclass 2 - mean 29 yrs',shade=True)
sns.kdeplot(full_set['Age'][full_set['Pclass'] == 3],legend=True,label='Pclass 3 - mean 24 yrs',shade=True)
plt.legend();

In [None]:
full_set.groupby('Pclass')['Age'].describe().transpose()

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 39

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

In [None]:
full_set['Age'] = full_set[['Age','Pclass']].apply(impute_age,axis=1)

### <a id='Link-SecC-c'>Embarked feature </a>

Only 2 missing data here. Checking the passengers individually.

In [None]:
full_set[full_set['Embarked'].isnull()]

Filtering the full data set and counting of all passangers with known cabin that paid 80 or more for the ticked:

In [None]:
sim_embarked = full_set[(full_set.Pclass == 1) & 
                            (full_set['Cabin'] == 1) &
                            (full_set['Fare'] >= 80)]
sim_embarked['Embarked'].value_counts()

Decided to impute these 2 individuals as 'C' 

In [None]:
full_set['Embarked'].fillna('C',inplace=True)

### <a id='Link-SecC-d'>Fare feature </a>

Only 1 individual that we don't know the `Fare` paid.

In [None]:
full_set[full_set['Fare'].isnull()]

Will look at individuals in third class that embarked in Southampton

In [None]:
sim_fare = full_set[(full_set['Pclass']==3)&(full_set['Embarked']=='S')]

In [None]:
sim_fare['Fare'].hist(bins=20);

Will use the mode value here

In [None]:
sim_fare['Fare'].mode()

In [None]:
full_set['Fare'].fillna(8.05,inplace=True)

In [None]:
full_set.drop('Survived',axis=1).isnull().sum()

### <a id='Link-SecC-e'>Name feature - exctracting title </a>  

In [None]:
full_set['Name'].str.extract(' ([A-Za-z]+)\.', expand=False).value_counts()

In [None]:
c = full_set['Name'].tolist()

In [None]:
d = []

def ttle(name):
    for i in c:
        if 'Mr.' in i:
            d.append('Mr')
        elif 'Dr.' in i:
            d.append('Mr')
        elif 'Rev.' in i:
            d.append('Mr')
        elif 'Col.' in i:
            d.append('Mr')
        elif 'Capt.' in i:
            d.append('Mr')
        elif 'Major.' in i:
            d.append('Mr')
        elif 'Don.' in i:
            d.append('Mr')
        elif 'Sir.' in i:
            d.append('Mr')
        
        elif 'Master.' in i:
            d.append('Master')
        
        elif 'Miss.' in i:
            d.append('Miss')
        elif 'Mlle.' in i:
            d.append('Miss')
        elif 'Mme.' in i:
            d.append('Miss')
        elif 'Ms.' in i:
            d.append('Miss')
            
        elif 'Mrs.' in i:
            d.append('Mrs')
        elif 'Countess.' in i:
            d.append('Mrs')
        elif 'Lady.' in i:
            d.append('Mrs')
        elif 'Dona.' in i:
            d.append('Mrs')
        else:
            d.append('Mr')
            
ttle(c)

In [None]:
d2 = pd.Series(d)

In [None]:
d2.value_counts()

In [None]:
full_set.reset_index(inplace=True)

In [None]:
full_set['title'] = d2

In [None]:
full_set.set_index(keys='PassengerId',inplace=True)

In [None]:
full_set.tail()

### <a id='Link-SecC-f'>Parch and SibSp - The family </a>  

In [None]:
def sib_family(row):
    if row['SibSp'] > 0:
        return 1
    else:
        return 0

def par_family(row):
    if row['Parch'] > 0:
        return 1
    else:
        return 0
def family(row):
    if row['Family'] > 0:
        return 1
    else:
        return 0

In [None]:
full_set['SibSp'] = full_set.apply(lambda row: sib_family(row),axis=1)

In [None]:
full_set['Parch'] = full_set.apply(lambda row: par_family(row),axis=1)

In [None]:
full_set['Family'] = full_set['SibSp'] + train['Parch']   

In [None]:
full_set['Family'] = full_set.apply(lambda row: family(row),axis=1)
full_set['Family'].value_counts()

In [None]:
full_set.head()

### <a id='Link-SecC-g'>Tidying up</a>

In [None]:
full_set = full_set.drop(['Fare','Name','SibSp','Parch','Ticket'],axis=1)

In [None]:
full_set.head()

In [None]:
full_set.info()

In [None]:
full_set.dtypes

Breaking up full_data set back to train and test data and exporting to csv to be used for ML algorithms

In [None]:
n = len(train)

In [None]:
train_data = full_set[:n]

In [None]:
test_data = full_set[n:]
test_data = test_data.drop(['Survived'],axis=1)

In [None]:
test_data.shape

In [None]:
train_data.shape

In [None]:
train_data.to_csv('./train-processed.csv',index=True)

In [None]:
test_data.to_csv('./test-processed.csv',index=True)

___

# <a id='Link-SecD'>Machine Learning</a>

___

In [None]:
import pandas as pd

In [None]:
train_data = pd.read_csv('train-processed.csv',index_col='PassengerId')
test_data = pd.read_csv('test-processed.csv',index_col='PassengerId')

In [None]:
from sklearn.model_selection import train_test_split

X = train_data.drop(['Survived'],axis=1)  # just the features
y = train_data['Survived']                # just the labels

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3)

In [None]:
X_train

## <a id='Link-SecD-a'>Preprocessing</a>

In [None]:
# To display sklearn interactive diagrams:
from sklearn import set_config
set_config(display='diagram')

In [None]:
#Load required packages
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [None]:
# Creating a list of numerical and categorcal columns based on their dtype
from sklearn.compose import make_column_selector as selector

num_cols = selector(dtype_exclude=object)
cat_cols = selector(dtype_include=object)

num_cols = num_cols(X)
cat_cols = cat_cols(X)

In [None]:
#Categorical and numerical columns transformation pipelines
#Encoding categorical columns with scikitlearn OneHotEnccoder
#Scaling the numerical columns with StandardScaler

cat_transformer_onehot = Pipeline(steps=[('onehot_transf',OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline(steps=[('scaler', StandardScaler())])

#Applying the ColumnTransformer preprocessing for numerical and categorical data
preprocessor = ColumnTransformer([('categoricals', cat_transformer_onehot, cat_cols),
                                  ('numericals', num_transformer, num_cols)],
                                 remainder = 'passthrough')

In [None]:
num_transformer

In [None]:
cat_transformer_onehot

In [None]:
preprocessor

## <a id='Link-SecD-b'>Model Training</a>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

rfc_model = RandomForestClassifier()

#Bundle preprocessing and modeling code in a pipeline
my_pipeline_RFC = Pipeline(steps=[('preprocessor', preprocessor), 
                                  ('rfc_model', rfc_model)])

#Hyperparameter tuning implementation
param_grid_RFC = {
    'rfc_model__criterion': ['gini','entropy'],
    'rfc_model__n_estimators': [100, 235, 300], 
    'rfc_model__max_depth': [10, 30,50, 100], 
    'rfc_model__min_samples_split': [5, 15, 25],
    'rfc_model__random_state': [10],
    
}

searchCV_RFC = RandomizedSearchCV(my_pipeline_RFC,
                                  param_distributions=param_grid_RFC,
                                  cv=5, scoring='accuracy',n_jobs=-1)


#searchCV_RFC = GridSearchCV(my_pipeline_RFC, 
#                            param_grid=param_grid_RFC,
#                            cv=5, scoring='accuracy',n_jobs=-1)

final = searchCV_RFC.fit(X_train, y_train)

print('Best parameters for the Random Forest Classifier: \n',
      searchCV_RFC.best_params_) 
print('Best accuracy score for the Random Forest Classifier: ',
      searchCV_RFC.best_score_)

### <a id='Link-SecD-h'>Model Evaluation </a>

In [None]:
best_model_RFC = RandomForestClassifier(criterion = 'entropy',
                                        max_depth=30,
                                        min_samples_split= 25,
                                        n_estimators= 300,
                                        random_state= 10,
                                        )


my_pipeline_best_RFC = Pipeline(steps=[('preprocessor', preprocessor), 
                                       ('best_model_RFC', best_model_RFC)])

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=100, test_size=0.3, random_state=0)

cv_results = cross_validate(my_pipeline_best_RFC,
                            X_test,
                            y_test,
                            cv=cv,
                            scoring="accuracy",
                            return_train_score=True)

In [None]:
cv_results = pd.DataFrame(cv_results)#[['test_score','train_score']])
cv_results.head()

In [None]:
import matplotlib.pyplot as plt

cv_results[['test_score','train_score']].plot.hist(bins=15, edgecolor="white", density=True,alpha=0.5)
plt.xlabel("Accuracy")
_ = plt.title("Test score distribution")

In [None]:
print(f"Classifier accuracy is on the test dataset was: {cv_results['test_score'].mean():.2f} +/- {cv_results['test_score'].std():.2f}")

### <a id='Link-SecD-i'>Kaggle Submission </a>

In [None]:
my_pipeline_final = searchCV_RFC.best_estimator_
predictions = my_pipeline_final.predict(test_data)

In [None]:
final_data = predictions.astype(int)

In [None]:
#Generate output
output = pd.DataFrame({'PassengerId': test_data.index, 
                       'Survived': final_data})
output.to_csv('submission_v12-rfc.csv', index=False)
print("File saved!")