<a href="https://colab.research.google.com/github/slyofzero/Kaggle-Notebooks/blob/main/Titanic%20Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Content

* [Loading the data](#loading-data)
* [Data Cleaning](#data-cleaning)
* [Exploratory Data Analysis](#eda)
    * [Univariate Analysis](#uni)    
        1. [Target Variable (Survived column)](#target)
        2. [Continous Features (Age and Fare columns)](#cont)
        3. [Discrete Features (Pclass, Sex, and Embarked columns)](#dis)
        4. [Number of relatives (SibSp and Parch columns)](#relations)
    * [Multivariate Analysis](#multi)
        1. [Correlation Heatmap](#correlation)
        2. [Distribution based on Survival (Age and Fare)](#age-and-fare)
            - [Analysis on Age](#age-analysis)
            - [Analysis on Fare](#fare-analysis)
        3. [Distribution based on Survival (Pclass, Sex, and Embarked)](#pclass-sex-embarked)
            - [Analysis on Pclass](#pclass)
            - [Analysis on Sex](#sex)
            - [Analysis on Embarked](#embarked)
        4. [Distribution based on Survival (SibSp and Parch)](#sibsp-parch)
            - [Analysis on SibSp](#sibsp)
            - [Analysis on Parch](#parch)
        5. [EDA Outcomes](#eda-outcomes)
* [Machine Learning models](#ml)
    * [Encoding the string values](#encode)
    * [Splitting the data](#split)
    * [Baseline Models](#baseline)
        1. [Decision Tree Classifier](#dtree1)
        2. [Random Forest Classifier](#rf1)
    * [SMOTE Oversmapling](#smote)
        1. [Decision Tree Classifier](#dtree2)
        2. [Random Forest Classifier](#rf2)
    * [Hyperparameter Tuning](#hyperparameter)
        1. [RFE](#rfe)
        2. [RandomizedSearchCV](#randomcv)
* [Final Prediction](#final-pred)

# Loading the data <a id="loading-data"></a>

In [1]:
# Importing all the neccessary modules.
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")

  shapely_geos_version, geos_capi_version_string


In [2]:
# Loading the files.
train_df = pd.read_csv("../input/titanic/train.csv")
test_df = pd.read_csv("../input/titanic/test.csv")

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---

# Data Cleaning <a id = "data-cleaning"></a>

Now that we have our data loaded, let's check the types of columns we have here and which of them have null values.

In [3]:
# Checking the datatype in each column.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
# Checking for null values.
round((train_df.isna().sum() / train_df.shape[0]) * 100, 2)

PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

From the above output we can see that the `Age` column has 19.87% null values, the `Cabin` column has 77.1% null values and `Embarked` column has 0.22% of null values.

Because the `Cabin` has has > 30% of null values, we can drop it. For the `Age` column we'll replace all the null values with the median values and for the `Embarked` column we we'll replace all the null values with the mode (we'll use mode instead of median because the data in the column is categorical).

In [5]:
# Handling null values.
train_df = train_df.drop(columns = ["PassengerId", "Name", "Cabin", "Ticket"])
train_df["Age"].fillna(value = train_df["Age"].median(), inplace = True)
train_df["Embarked"].fillna(value = train_df["Embarked"].mode()[0], inplace = True)

In [6]:
# Checking if the data got cleaned.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


---

# Exploratory Data Analysis (EDA) <a id="eda"></a>

## Univariate Analysis <a id="uni"></a>

### 1) Target Variable (Survived column) <a id="target"></a>

From the above output we can see that our target column, `Survived`, has categorical data in it. To check its distribution we can plot the counts of the values in it on a bar graph.

In [7]:
# Plotting the counts of the values in the Survived column as a bar graph.
bgcolor = "#0a0a33"
color_sequence = ["#3562e8", "#eb3462", "#a7c8d1", "#ccc504"]
font_color = "white"
border_color = "black"
labels = ["Didn't Survive", "Survived"]

survivor = px.histogram(
    data_frame = train_df, x = "Survived", color = "Survived", color_discrete_sequence = color_sequence,
    title = "Survivor Data"
    )

survivor.update_layout(bargap = 0.6)

survivor.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
survivor.update_layout(
    height = 650, plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, 
    bargap = 0.6, xaxis = dict(tickmode = "array", tickvals = [0, 1], ticktext = labels))

survivor.show()

From the graph above, we can infer that most of the people who were on the ship were not able to escape in time. 549 people were not able to survive the disaster while 342 were able to. Let's check for the possible factors that contributed in a person's survival rate.

### 2) Continous Features (Age and Fare columns) <a id="cont"></a>

From the dataframe information we got earlier we can infer that `Age` and `Fare` are continous variables. So let's plot histograms to check their distribution.

In [8]:
# Plotting histograms for Age and Fare.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"], shared_yaxes = True)

# Age Distribution.
ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence)
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(ages.data, rows = 1, cols = 1)

# Fare Distribution.
fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence)
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(fares.data, rows = 1, cols = 2)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

The `Age Distribution` seems to be normally distributed but the `Fare Distribution` seems to have a few outliers of 500 which is ruining the entire graph. Let's remove any `Fare` values that are above 267.5 and plot the graph again.

In [9]:
# Removing unwanted values.
train_df = train_df[train_df["Fare"] <= 267.5]

# Making a subplot.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"], shared_yaxes = True)

# Age Distribution
ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence)
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

# Fare Distribution
fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence)
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fig.add_traces(ages.data, rows = 1, cols = 1)
fig.add_traces(fares.data, rows = 1, cols = 2)
fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

From the graphs above we can infer that most of the people onboard the ship were about 30 years old and didn't spend much on the ticket fare.

### 3) Discrete Features (Pclass, Sex, and Embarked columns) <a id = "dis"></a>

Now let's check the distribution of `Pclass`, `Sex`, and `Embarked`.

In [10]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 3, subplot_titles = [f"Distribution for {column}" for column in ["Pclass", "Sex", "Embarked"]], shared_yaxes = True)

# Plotting graph for Pclass
pcounts = train_df["Pclass"].value_counts().reset_index(name = "P_counts")
pcounts["index"] = pcounts["index"].astype(str)

bar_obj1 = go.Bar(
    x = pcounts["index"], y = pcounts["P_counts"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = pcounts["P_counts"], textposition = "outside")
fig.add_traces(bar_obj1, rows = 1, cols = 1)

# Plotting graph for Sex
sex_counts = train_df["Sex"].value_counts().reset_index(name = "Sex")
sex_counts["index"] = sex_counts["index"].astype(str)

bar_obj2 = go.Bar(
    x = sex_counts["index"], y = sex_counts["Sex"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = sex_counts["Sex"], textposition = "outside")

fig.add_traces(bar_obj2, rows = 1, cols = 2)

# Plotting graph for Embarked
embarked_counts = train_df["Embarked"].value_counts().reset_index(name = "Embarked")
embarked_counts["index"] = embarked_counts["index"].astype(str)

bar_obj3 = go.Bar(
    x = embarked_counts["index"], y = embarked_counts["Embarked"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = embarked_counts["Embarked"], textposition = "outside")
fig.add_traces(bar_obj3, rows = 1, cols = 3)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Discrete Variables", showlegend = False)
fig.update_layout({f"yaxis{i}": dict(range=[0,750]) for i in range(1, 4)})
fig.show()

From the above graphs we can infer that the ship mostly had passengers who -
1. Had a 3$^{rd}$ class ticket.
2. Were male.
3. Were from Southampton (Embarked = "S").

### 4) Number of relatives (SibSp and Parch columns) <a id = "relations"></a>

Now let's check the distribution number of relatives onboard for the passengers.

In [11]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{column} Distribution" for column in ["SibSp", "Parch"]], shared_yaxes = True, x_title = "Number of Relations")

# SibSp distribution.
sibsp = go.Histogram(x = train_df["SibSp"], marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), hovertext = train_df["SibSp"].value_counts())
fig.add_traces(sibsp, rows = 1, cols = 1)

# Parch distribution.
parch = go.Histogram(x = train_df["Parch"], marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), hovertext = train_df["Parch"].value_counts())
fig.add_traces(parch, rows = 1, cols = 2)

fig.update_layout(plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Relations Distribution")
fig.show()

From the above graphs we can infer that most of the people onboard didn't have a relative on the ship.

---

## Multivariate Analysis <a id = "multi"></a>

Now that we know the distribution for each variable, let's check which of these factors affected in a person's survival.

### 1) Correlation Heatmap <a id = "correlation"></a>

Let's plot the correlation heatmap for all the columns of the dataframe.

In [12]:
# Encoding the values in Embarked.
from sklearn.preprocessing import LabelEncoder

temp_df = train_df.copy()
le = LabelEncoder()

def encode_label(column):
  temp_df[column] = le.fit_transform(temp_df[column])
  return {column: dict(enumerate(le.classes_))}

encoding_key = list(map(encode_label, ["Sex", "Embarked"]))
print(f"Encoding key -\n\t{encoding_key}\n{'-' * 100}")

print("Encoded Table - ")
temp_df.head()

Encoding key -
	[{'Sex': {0: 'female', 1: 'male'}}, {'Embarked': {0: 'C', 1: 'Q', 2: 'S'}}]
----------------------------------------------------------------------------------------------------
Encoded Table - 


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


In [13]:
# Correlation matrix for train_df.
mask = np.triu(np.ones_like(temp_df.corr(), dtype = bool))
corr = round(temp_df.corr(), 2).mask(mask)

corr_matrix = px.imshow(corr, text_auto = True)
corr_matrix.update_layout(height = 700, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, xaxis_showgrid = False, yaxis_showgrid = False)
corr_matrix.show()

From the above graph we can see that `Pclass` and `Fare` have the highest correlation out of all the column pairs. A correlation coefficient of -0.6 means that higher the passenger's Pclass value, lower the fare they pay. Meaning, passengers of higher economic backgrounds (Pclass = 1) would pay the highest fares while passengers of lower economic backgrounds (Pclass = 3) would pay the lowest fares.

\

The correlation coefficient between `Sex` and `Survived` is -0.55 suggesting that if a passenger has a high value of sex (Sex = 1 / male), their survival rate would be very low.

\

The correlation coefficient of 0.26 between `Fare` and `Survived` columns suggests that passengers who paid higher fares had higher survival rates.

\

The correlation coefficient of -0.33 between `Pclass` and `Survived` columns suggests that passengers with a higher Pclass value (lower economic background) had lower chances of survival.

\

The correlation coefficient of 0.42 between `SibSp` and `Parch` columns suggests that passengers who had siblings onboard had higher chances of having a parent/child onboard aswell and vice versa.

\

The correlation coefficient of -0.34 between `Age` and `Pclass` suggests that most passengers that belonged to a higher economic background were had more age, while most passengers that belonged to a lower economic background had lesser age.

---

### 2) Distribution based on Survival (Age and Fare) <a id = "age-and-fare"></a>

Let's first check the difference in distributions of `Age` and `Fare` columns for people who survived and for people who didn't.

In [14]:
# Making a subplot.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{name} Distribution" for name in ["Age", "Fare"]])

# Plotting the KDE plot for Age.
age_data = [train_df.groupby("Survived")["Age"].get_group(group) for group in train_df["Survived"].unique()]
age_kde = ff.create_distplot(
    hist_data = age_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["lime", "orange"]
    )
fig.add_traces(age_kde.data, rows = 1, cols = 1)

# Plotting the KDE plot for Fare.
fare_data = [train_df.groupby("Survived")["Fare"].get_group(group) for group in train_df["Survived"].unique()]
fare_kde = ff.create_distplot(
    hist_data = fare_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["red", "turquoise"]
    )
fig.add_traces(fare_kde.data, rows = 1, cols = 2)

fig.update_layout(showlegend = False, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution based on Survival")
fig.update_layout({f"xaxis{i}": dict(showgrid = False) for i in range(1,3)})
fig.show()

#### **i) Analysis on Age** <a id = "age-analysis"></a>
From the above graph we can see that the trends in `Age` are relatively the same for both groups.

#### **ii) Analysis on Fare** <a id = "fare-analysis"></a>

The trend in `Fare` however tells us that the people who paid a higher price for the ticket had a higher chance of survival. This observation kind of makes sense as people who paid a higher price would have been provived much safer cabins, and might have been given priority during the evacuation because of their societial status (the high correlation between `Pclass` and `Fare` supports this arguement). 

This is further supported by the high correlation between `Fare` and `Survived` columns. 

This also suggests that the people who paid higher ticket fares might belong to a higher economic class. Meaning, if we see that the people belonging to a higher economic class had a higher survival rate, our hypothesis of **"priority during evacuation"** would have more support.

---

### 3) Distribution based on Survival (Pclass, Sex, and Embarked) <a id = "pclass-sex-embarked"></a>

Now let's check whether a passenger's `Pclass`, `Sex`, or `Embarked` value could have affected their survival rate.

In [15]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 3, subplot_titles = [f"Distribution for {column}" for column in ["Pclass", "Sex", "Embarked"]], shared_yaxes = True)

# Grouped distribution for Pclass
survivor_by_pclass = px.histogram(data_frame = train_df, x = "Pclass", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_pclass.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_pclass.data, rows = 1, cols = 1)

# Grouped distribution for Sex
survivor_by_sex = px.histogram(data_frame = train_df, x = "Sex", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_sex.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_sex.data, rows = 1, cols = 2)

# Grouped distribution for Embarked
survivor_by_embarked = px.histogram(data_frame = train_df, x = "Embarked", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_embarked.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_embarked.data, rows = 1, cols = 3)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Variables grouped by survivors")
fig.update_layout({f"yaxis{i}":dict(range = [0,510]) for i in range(1, 4)})
fig.show()

From the above graph we can see that most of the passengers who weren't able to survive were -

1. From a lower economic background.
2. Male.
3. Boarded the ship at Southampton (Embarked = "S").

#### **i) Analysis on Pclass** <a id = "pclass"></a>
Most of the people belonging to a lower economic backgroud (Pclass = 3) were not able to survive the disaster, while most of the people belonging to a higher economic background were able to (Pclass = 1). This is also supported by the high correlation between `Pclass` and `Survived`.

The above statement also provides support for our hypothesis of **priority during evacuation**. The reason behind such a trend might be the influence of these people on the society. It is very possible that the people of a higher economic background might have been big business owners, politicians or government officials.

#### **ii) Analysis on Sex** <a id = "sex"></a>

Out of the survivors most of the passengers were women (Sex = female), while out of the non-survivors most of the passengers were men (Sex = male). This is also shown by the correlation coefficient between `Sex` and `Survived`. This might be because of the unofficial societial rule of "women and children first" during any kind of disaster.

#### **iii) Analysis on Embarked** <a id = "embarked"></a>

Most of the people who boarded the ship at Southampton (Embarked = "S") were not able to survive the disaster, while most of the people who boarded the ship at Cherbourg (Embarked = "C") were able to survive. This might be because most of the passengers who were from Southampton were either male of were from a lower economic backgroud. 

The above statement supported by the correlation matrix as the `Embarked` column has the correlation coefficient of 0.15 and 0.11 respectively for `Pclass` and `Sex`. This suggests that for higher values of `Embarked` (Embarked = 3 means Southampton), we would have higher values of `Pclass` (Pclass = 3 means lower economic class) and `Sex` (Sex = 1 means male).

---

### 4) Distribution based on Survival (SibSp and Parch) <a id = "sibsp-parch"></a>

Now let's check whether a passenger's `SibSp`, or `Parch` value could have affected their survival rate.

In [16]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{column} Distribution" for column in ["SibSp", "Parch"]], shared_yaxes = True, x_title = "Number of Relations")

# Plotting the KDE plot for SibSp.
sibsp_data = [train_df.groupby("Survived")["SibSp"].get_group(group) for group in train_df["Survived"].unique()]
sibsp_kde = ff.create_distplot(
    hist_data = sibsp_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["lime", "red"]
    )
fig.add_traces(sibsp_kde.data, rows = 1, cols = 1)

# Plotting the KDE plot for Parch.
parch_data = [train_df.groupby("Survived")["Parch"].get_group(group) for group in train_df["Survived"].unique()]
parch_kde = ff.create_distplot(
    hist_data = parch_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["orange", "turquoise"]
    )
fig.add_traces(parch_kde.data, rows = 1, cols = 2)

fig.update_layout(plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Relations Distribution")
fig.update_layout({f"xaxis{i}": dict(showgrid = False) for i in range(1,3)})
fig.show()

#### **i) Analysis on SibSp** <a id = "sibsp"></a>

From the above graph we can see that passengers who had 1 sibling onboard somehow had higher survival rates. This might be because they helped each other out (??) or it might just be a random coincidence.

#### ii) **Analysis on Parch** <a id = "parch"></a>

From the above graph we can see that the trends in `Parch` value were relatively the same for both the groups.

---

## EDA outcomes <a id = "eda-outcomes"></a>

From the EDA we did above, it seems like `Pclass`, `Fare`, and `Sex` would be the most significant features in model building.

---

# Machine Learning models <a id = "ml"></a>

As this is a classification problem, we'll be using **Decision Trees** and **Random Forests** to predict a passenger's survival.

## Encoding string values. <a id = "encode"></a>

In our dataframe, there are a few columns with string values. Those being `Sex` and `Embarked`. Neither are ordinal variables meaning encoding the values as 0, 1, 2... won't make much sense. So let's do **one-hot encoding** on them.

In [17]:
# Encoding the string columns.
str_df = train_df.select_dtypes(include = object)
train_df.drop(columns = str_df.columns, inplace = True)
encoded_df = pd.get_dummies(data = str_df, drop_first = True)
train_df = pd.concat([train_df, encoded_df], axis = 1)
train_df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


## Splitting the data <a id = "split"></a>

Now that the string values have encoded, let's split our data into training and testing data. Even though the file names for both the datasets is `train.csv` and `test.csv`, we can't use `test.csv` as a test data to evaluate our model's performance as it doesn't have the `Survived` column in it.

In [18]:
# Splitting the data.
from sklearn.model_selection import train_test_split

features_df = train_df.drop(columns = ["Survived"])
target_df = train_df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(features_df, target_df, test_size = 0.3, random_state = 42)
X_train

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
486,1,35.0,1,0,90.0000,0,0,1
293,3,24.0,0,0,8.8500,0,0,1
172,3,1.0,1,1,11.1333,0,0,1
450,2,36.0,1,2,27.7500,1,0,1
361,2,29.0,1,0,27.7208,1,0,0
...,...,...,...,...,...,...,...,...
106,3,21.0,0,0,7.6500,0,0,1
271,3,25.0,0,0,0.0000,1,0,1
863,3,28.0,8,2,69.5500,0,0,1
436,3,21.0,2,2,34.3750,0,0,1


The original `train_df` had 888 rows after data cleaning. The train data `X_train` we got after splitting the original data has 621 rows. This might cause some issues as our predictions can be more accurate if we have more data.

---

## Baseline models <a id = "baseline"></a>

Now that our data has been split into train and test, let's make a baseline model using it. We'll try both Decision Trees and Random Forest for this.

### 1) Decision Tree Classifier <a id = "dtree1"></a>

In [19]:
# Importing the algorithms.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [20]:
# Making a baseline Decision Tree model.
dtree_clf = DecisionTreeClassifier()
dtree_clf.fit(X_train, y_train)
print(f"Training data accuracy = {dtree_clf.score(X_train, y_train) * 100:.2f}%")
print(f"Testing data accuracy = {dtree_clf.score(X_test, y_test)* 100:.2f}%")

Training data accuracy = 98.23%
Testing data accuracy = 77.90%


As we can see above, the accuracy of the model has a huge difference for the training and the testing data. This is a standard characteristic of the Decision Tree algorithm. It tends to overfit a lot. Let's get a better understanding of the predictions using `confusion_matrix` and `classification_report`.

In [21]:
# Importing plot_confusion_matrix, classification_report functions.
from sklearn.metrics import classification_report, confusion_matrix

In [22]:
# Printing the confusion matrix and the classification report.
dtree_train_pred = dtree_clf.predict(X_train)
dtree_test_pred = dtree_clf.predict(X_test)

print(f"For Train Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_train, dtree_train_pred))
print(f"\nClassification Report -")
print(classification_report(y_train, dtree_train_pred))

print("-" * 100)

print(f"For Test Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_test, dtree_test_pred))
print(f"\nClassification Report -")
print(classification_report(y_test, dtree_test_pred))

For Train Predictions -

Confusion Matrix -
[[388   1]
 [ 10 222]]

Classification Report -
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       389
           1       1.00      0.96      0.98       232

    accuracy                           0.98       621
   macro avg       0.99      0.98      0.98       621
weighted avg       0.98      0.98      0.98       621

----------------------------------------------------------------------------------------------------
For Test Predictions -

Confusion Matrix -
[[137  23]
 [ 36  71]]

Classification Report -
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       160
           1       0.76      0.66      0.71       107

    accuracy                           0.78       267
   macro avg       0.77      0.76      0.76       267
weighted avg       0.78      0.78      0.78       267



From the classification reports for the train and test data, we can see that the precision and recall scores are very good for the training data but average for the test data. This shows that our model is overfitting the data heavily.

Let's try Random Forest to tackle this issue.

### 2) Random Forest Classifier <a id = "rf1"></a>

In [23]:
# Making a model using Random Forest Classifier.
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
print(f"Training data accuracy = {rf_clf.score(X_train, y_train) * 100:.2f}%")
print(f"Testing data accuracy = {rf_clf.score(X_test, y_test)* 100:.2f}%")

Training data accuracy = 98.23%
Testing data accuracy = 80.52%


In [24]:
# Printing the confusion matrix and the classification report.
rf_train_pred = rf_clf.predict(X_train)
rf_test_pred = rf_clf.predict(X_test)

print(f"For Train Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_train, rf_train_pred))
print(f"\nClassification Report -")
print(classification_report(y_train, rf_train_pred))

print("-" * 100)

print(f"For Test Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_test, rf_test_pred))
print(f"\nClassification Report -")
print(classification_report(y_test, rf_test_pred))

For Train Predictions -

Confusion Matrix -
[[385   4]
 [  7 225]]

Classification Report -
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       389
           1       0.98      0.97      0.98       232

    accuracy                           0.98       621
   macro avg       0.98      0.98      0.98       621
weighted avg       0.98      0.98      0.98       621

----------------------------------------------------------------------------------------------------
For Test Predictions -

Confusion Matrix -
[[139  21]
 [ 31  76]]

Classification Report -
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       160
           1       0.78      0.71      0.75       107

    accuracy                           0.81       267
   macro avg       0.80      0.79      0.79       267
weighted avg       0.80      0.81      0.80       267



Even though we have a better accuracy score for the test data, the overfitting hasn't been resolved. From the classification report we can see that our model is able to predict class 0 better than class 1. That might be because of the target imbalance. Let's try and fix it.

---

## SMOTE Oversampling <a id = "smote"></a>

In [25]:
# Applying smote.
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state = 42)
smote_X_train, smote_y_train = smote.fit_resample(X_train, y_train)
smote_y_train.value_counts()

1    389
0    389
Name: Survived, dtype: int64

Now that we have our data balanced, let's train our models on it.

### 1) Decision Tree Classifier <a id = "dtree2"></a>

In [26]:
# Training a model on the oversampled data.
dtree_clf2 = DecisionTreeClassifier(random_state = 42)
dtree_clf2.fit(smote_X_train, smote_y_train)
print(f"Training data accuracy = {dtree_clf2.score(smote_X_train, smote_y_train) * 100:.2f}%")
print(f"Testing data accuracy = {dtree_clf2.score(X_test, y_test)* 100:.2f}%")

Training data accuracy = 98.59%
Testing data accuracy = 76.40%


In [27]:
# Printing the confusion matrix and the classification report.
dtree_train_pred2 = dtree_clf2.predict(X_train)
dtree_test_pred2 = dtree_clf2.predict(X_test)

print(f"For Train Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_train, dtree_train_pred2))
print(f"\nClassification Report -")
print(classification_report(y_train, dtree_train_pred2))

print("-" * 100)

print(f"For Test Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_test, dtree_test_pred2))
print(f"\nClassification Report -")
print(classification_report(y_test, dtree_test_pred2))

For Train Predictions -

Confusion Matrix -
[[388   1]
 [ 10 222]]

Classification Report -
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       389
           1       1.00      0.96      0.98       232

    accuracy                           0.98       621
   macro avg       0.99      0.98      0.98       621
weighted avg       0.98      0.98      0.98       621

----------------------------------------------------------------------------------------------------
For Test Predictions -

Confusion Matrix -
[[130  30]
 [ 33  74]]

Classification Report -
              precision    recall  f1-score   support

           0       0.80      0.81      0.80       160
           1       0.71      0.69      0.70       107

    accuracy                           0.76       267
   macro avg       0.75      0.75      0.75       267
weighted avg       0.76      0.76      0.76       267



Somehow the performance of the Decision Trees model dropped. Let's try Random Forest next.

### 2) Random Forest Classifier <a id = "rf2"></a>

In [28]:
# Training a model on the oversampled data.
rf_clf2 = RandomForestClassifier(random_state = 42)
rf_clf2.fit(smote_X_train, smote_y_train)
print(f"Training data accuracy = {rf_clf2.score(smote_X_train, smote_y_train) * 100:.2f}%")
print(f"Testing data accuracy = {rf_clf2.score(X_test, y_test)* 100:.2f}%")

Training data accuracy = 98.59%
Testing data accuracy = 81.27%


In [29]:
# Printing the confusion matrix and the classification report.
rf_train_pred2 = rf_clf2.predict(X_train)
rf_test_pred2 = rf_clf2.predict(X_test)

print(f"For Train Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_train, rf_train_pred2))
print(f"\nClassification Report -")
print(classification_report(y_train, rf_train_pred2))

print("-" * 100)

print(f"For Test Predictions -\n")
print(f"Confusion Matrix -")
print(confusion_matrix(y_test, rf_test_pred2))
print(f"\nClassification Report -")
print(classification_report(y_test, rf_test_pred2))

For Train Predictions -

Confusion Matrix -
[[384   5]
 [  6 226]]

Classification Report -
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       389
           1       0.98      0.97      0.98       232

    accuracy                           0.98       621
   macro avg       0.98      0.98      0.98       621
weighted avg       0.98      0.98      0.98       621

----------------------------------------------------------------------------------------------------
For Test Predictions -

Confusion Matrix -
[[139  21]
 [ 29  78]]

Classification Report -
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       160
           1       0.79      0.73      0.76       107

    accuracy                           0.81       267
   macro avg       0.81      0.80      0.80       267
weighted avg       0.81      0.81      0.81       267



Even with a balanced data our model is still overfitting.

As the performance on the model made by Random Forest Classifier, is better than the performance on the model made using Decision Tree Classifier, we'll try hyperparameter tuning on Random Forest to improve our accuracy.

---

## Hyperparameter Tuning <a id = "hyperparameter"></a>

Let's first find the maximum depth of the trees and maximum number of leaf nodes in the random forests created.

If the `max_depth` and `max_leaf_nodes` are higher than needed then our model might overfit the data giving us a less generalized model.

In [30]:
# Finding max_depth and max_leaf_nodes.
max_depth = max([estimator.tree_.max_depth for estimator in rf_clf2.estimators_])
max_leaf_nodes = max([estimator.tree_.node_count for estimator in rf_clf2.estimators_])

print(f"Max depth for the Decision Trees made by the Random Forest was : {max_depth}")
print(f"Max number of terminal nodes for the Decision Trees made by the Random Forest was : {max_leaf_nodes}")

Max depth for the Decision Trees made by the Random Forest was : 24
Max number of terminal nodes for the Decision Trees made by the Random Forest was : 341


Now that we know the `max_depth` and `max_leaf_nodes`, we can change the hyperparameters accordingly to achieve the most ideal results possible.

### RFE <a id = "rfe"></a>

Along with ideal hyperparameter values, removing any redundant feature variables from the training data can also help in tackling overfitting. This can be done using **Recursive Feature Elimination** or `RFE`.

We can create a `Pipeline` with `RFE` and `RandomForest` inside it and then perform Cross Validition on it to get the most ideal results.

In [31]:
# Creating a Pipeline with RFE and RandomForest inside it.
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline

steps = [
         ("rfe", RFE(estimator = RandomForestClassifier(random_state = 42))),
         ("est", RandomForestClassifier())
]
rf_clf_pl = Pipeline(steps = steps)

### RandomizedSearchCV <a id = "randomcv"></a>

Now to find the ideal number of features and values for `RandomForest` hyperparameters we can use `RandomizedSearchCV`. `GridSearchCV` is the ideal option but because the number of iterations required to check all the desired values would be a very big number, we would go with `RandomizedSearchCV` instead.

In [32]:
# Applying RandomizedSearchCV on the Pipeline we created.
from sklearn.model_selection import RandomizedSearchCV

params = {
    "rfe__n_features_to_select" : range(2, smote_X_train.shape[1] + 1),
    "est__random_state" : np.linspace(0, 42, 5).astype(int),
    "est__n_estimators" : range(50, 201, 10),
    "est__max_depth" : [None] + list(range(5, max_depth, 3)),
    "est__max_leaf_nodes" : [None] + list(range(100, max_leaf_nodes, 20))
}

rs = RandomizedSearchCV(estimator = rf_clf_pl, cv = 4, param_distributions = params, n_jobs = -1, n_iter = 100, random_state = 42)
rs.fit(smote_X_train, smote_y_train)

RandomizedSearchCV(cv=4,
                   estimator=Pipeline(steps=[('rfe',
                                              RFE(estimator=RandomForestClassifier(random_state=42))),
                                             ('est',
                                              RandomForestClassifier())]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'est__max_depth': [None, 5, 8, 11, 14,
                                                           17, 20, 23],
                                        'est__max_leaf_nodes': [None, 100, 120,
                                                                140, 160, 180,
                                                                200, 220, 240,
                                                                260, 280, 300,
                                                                320, 340],
                                        'est__n_estimators': range(50, 201, 10),
                         

In [33]:
# Checking which values yielded the best results.
rs_results = pd.DataFrame(rs.cv_results_)[["params", "mean_test_score"]]
rs_results = pd.concat([pd.DataFrame(rs_results["params"].tolist()), rs_results["mean_test_score"]], axis = 1)
rs_results = rs_results.loc[rs_results["mean_test_score"].dropna().index, :].sort_values(by = "mean_test_score", ascending = False).reset_index(drop = True)
rs_results.head()

Unnamed: 0,rfe__n_features_to_select,est__random_state,est__n_estimators,est__max_leaf_nodes,est__max_depth,mean_test_score
0,4,0,170,120.0,23.0,0.845797
1,4,0,190,260.0,17.0,0.844522
2,4,10,70,200.0,23.0,0.841944
3,4,10,110,140.0,,0.83938
4,4,10,80,140.0,17.0,0.83938


The ideal values we seem to get from Cross Validation are -

* rfe_n_features_to_select = 4
* est_random_state = 0
* est_n_estimators = 170
* est_max_leaf_nodes = 120
* est_max_depth = 23

Now that we have the ideal values with us, let's check the accuracy on the test data.


In [34]:
# Checking the accuracy on the test data.
rf_clf_rs = rs.best_estimator_
rf_clf_rs.score(X_test, y_test)

print(f"Training data accuracy = {rf_clf_rs.score(smote_X_train, smote_y_train) * 100:.2f}%")
print(f"Testing data accuracy = {rf_clf_rs.score(X_test, y_test)* 100:.2f}%")

Training data accuracy = 98.07%
Testing data accuracy = 81.27%


As you can see, even after going through so many steps the model only had a 2% boost in its accuracy. The means that either the data still has a lot of issues with it, or that `RandomForestClassifier` isn't the correct algorithm for the task.

---

# Final Prediction <a id = "final-pred"></a>

Now that we have a model with about 83% accuracy, let's try to predict on the real test data `test_df`.

First we'll have to clean the data, then we can pass it into the model.

In [35]:
# Creating a function to clean the data.
def clean_data(df):
  df.drop(columns = ["Cabin", "Name", "Ticket", "PassengerId"], inplace = True)

  null_columns = df.columns[df.isna().sum().gt(0)]
  list(map(lambda column : df[column].fillna(value = df[column].median(), inplace = True), null_columns))

  str_df = df.select_dtypes(include = object)
  df.drop(columns = str_df.columns, inplace = True)
  encoded_df = pd.get_dummies(data = str_df, drop_first = True)
  df = pd.concat([df, encoded_df], axis = 1)
  
  return df

cleaned_test_df = clean_data(test_df.copy())
cleaned_test_df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,3,34.5,0,0,7.8292,1,1,0
1,3,47.0,1,0,7.0,0,0,1
2,2,62.0,0,0,9.6875,1,1,0
3,3,27.0,0,0,8.6625,1,0,1
4,3,22.0,1,1,12.2875,0,0,1


In [36]:
# Making the prediction.
test_pred = pd.DataFrame(rf_clf_rs.predict(cleaned_test_df), columns = ["Survived"])
test_pred = pd.concat([test_df["PassengerId"], test_pred], axis = 1)
test_pred.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0


Now that we have the predictions stored in a variable, let's store it into a `.csv` file.

In [37]:
# Converting the dataframe to a csv file.
pd.DataFrame(test_pred).to_csv("Predictions.csv", index = False)

# END OF THE NOTEBOOK

---