<a href="https://colab.research.google.com/github/slyofzero/Kaggle-Notebooks/blob/main/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading the data

In [155]:
# !mkdir ~/.kaggle
# !cp /content/drive/MyDrive/C.S/Kaggle/kaggle.json ~/.kaggle/kaggle.json
# !kaggle competitions download -c titanic
# !unzip /content/titanic.zip

In [156]:
# Importing all the neccessary modules.
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

In [157]:
# Loading the files.
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---

# Data Cleaning

Now that we have our data loaded, let's check the types of columns we have here and which of them have null values.

In [158]:
# Checking the datatype in each column.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [159]:
# Checking for null values.
round((train_df.isna().sum() / train_df.shape[0]) * 100, 2)

PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

From the above output we can see that the `Age` column has 19.87% null values, the `Cabin` column has 77.1% null values and `Embarked` column has 0.22% of null values.

Because the `Cabin` has has > 30% of null values, we can drop it. For the `Age` column we'll replace all the null values with the median values and for the `Embarked` column we we'll replace all the null values with the mode (we'll use mode instead of median because the data in the column is categorical).

In [160]:
# Handling null values.
train_df = train_df.drop(columns = ["PassengerId", "Name", "Cabin", "Ticket"])
train_df["Age"].fillna(value = train_df["Age"].median(), inplace = True)
train_df["Embarked"].fillna(value = train_df["Embarked"].mode()[0], inplace = True)

In [161]:
# Checking if the data got cleaned.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


---

# Exploratory Data Analysis (EDA)

## Univariate Analysis

### 1) Target Variable (Survived column)

From the above output we can see that our target column, `Survived`, has categorical data in it. To check its distribution we can plot the counts of the values in it on a bar graph.

In [162]:
# Plotting the counts of the values in the Survived column as a bar graph.
bgcolor = "#0a0a33"
color_sequence = ["#3562e8", "#eb3462", "#a7c8d1", "#ccc504"]
font_color = "white"
border_color = "black"
labels = ["Didn't Survive", "Survived"]

survivor = px.histogram(
    data_frame = train_df, x = "Survived", color = "Survived", color_discrete_sequence = color_sequence,
    title = "Survivor Data"
    )

survivor.update_layout(bargap = 0.6)

survivor.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
survivor.update_layout(
    height = 650, plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, 
    bargap = 0.6, xaxis = dict(tickmode = "array", tickvals = [0, 1], ticktext = labels))

survivor.show()

From the graph above, we can infer that most of the people who were on the ship were not able to escape in time. 549 people were not able to survive the disaster while 342 were able to. Let's check for the possible factors that contributed in a person's survival rate.

### 2) Continous Features (Age and Fare columns)

From the dataframe information we got earlier we can infer that `Age` and `Fare` are continous variables. So let's plot histograms to check their distribution.

In [163]:
# Plotting histograms for Age and Fare.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"], shared_yaxes = True)

# Age Distribution.
ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence)
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(ages.data, rows = 1, cols = 1)

# Fare Distribution.
fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence)
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(fares.data, rows = 1, cols = 2)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

The `Age Distribution` seems to be normally distributed but the `Fare Distribution` seems to have a few outliers of 500 which is ruining the entire graph. Let's remove any `Fare` values that are above 267.5 and plot the graph again.

In [164]:
# Removing unwanted values.
train_df = train_df[train_df["Fare"] <= 267.5]

# Making a subplot.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"], shared_yaxes = True)

# Age Distribution
ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence)
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

# Fare Distribution
fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence)
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fig.add_traces(ages.data, rows = 1, cols = 1)
fig.add_traces(fares.data, rows = 1, cols = 2)
fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

From the graphs above we can infer that most of the people onboard the ship were about 30 years old and didn't spend much on the ticket fare.

### 3) Discrete Features (Pclass, Sex, and Embarked columns)

Now let's check the distribution of `Pclass`, `Sex`, and `Embarked`.

In [165]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 3, subplot_titles = [f"Distribution for {column}" for column in ["Pclass", "Sex", "Embarked"]], shared_yaxes = True)

# Plotting graph for Pclass
pcounts = train_df["Pclass"].value_counts().reset_index(name = "P_counts")
pcounts["index"] = pcounts["index"].astype(str)

bar_obj1 = go.Bar(
    x = pcounts["index"], y = pcounts["P_counts"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = pcounts["P_counts"], textposition = "outside")
fig.add_traces(bar_obj1, rows = 1, cols = 1)

# Plotting graph for Sex
sex_counts = train_df["Sex"].value_counts().reset_index(name = "Sex")
sex_counts["index"] = sex_counts["index"].astype(str)

bar_obj2 = go.Bar(
    x = sex_counts["index"], y = sex_counts["Sex"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = sex_counts["Sex"], textposition = "outside")

fig.add_traces(bar_obj2, rows = 1, cols = 2)

# Plotting graph for Embarked
embarked_counts = train_df["Embarked"].value_counts().reset_index(name = "Embarked")
embarked_counts["index"] = embarked_counts["index"].astype(str)

bar_obj3 = go.Bar(
    x = embarked_counts["index"], y = embarked_counts["Embarked"], 
    marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), width = 0.4,
    text = embarked_counts["Embarked"], textposition = "outside")
fig.add_traces(bar_obj3, rows = 1, cols = 3)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Discrete Variables", showlegend = False)
fig.update_layout({f"yaxis{i}": dict(range=[0,750]) for i in range(1, 4)})
fig.show()

From the above graphs we can infer that the ship mostly had passengers who -
1. Had a 3$^{rd}$ class ticket.
2. Were male.
3. Were from Southampton (Embarked = "S").

### 4) Number of relatives (SibSp and Parch columns)

Now let's check the distribution number of relatives onboard for the passengers.

In [166]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{column} Distribution" for column in ["SibSp", "Parch"]], shared_yaxes = True, x_title = "Number of Relations")

# SibSp distribution.
sibsp = go.Histogram(x = train_df["SibSp"], marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), hovertext = train_df["SibSp"].value_counts())
fig.add_traces(sibsp, rows = 1, cols = 1)

# Parch distribution.
parch = go.Histogram(x = train_df["Parch"], marker = dict(color = color_sequence, line = dict(width = 1, color = border_color)), hovertext = train_df["Parch"].value_counts())
fig.add_traces(parch, rows = 1, cols = 2)

fig.update_layout(plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Relations Distribution")
fig.show()

From the above graphs we can infer that most of the people onboard didn't have a relative on the ship.

---

## Multivariate Analysis

Now that we know the distribution for each variable, let's check which of these factors affected in a person's survival.

### 1) Correlation Heatmap

Let's plot the correlation heatmap for all the columns of the dataframe.

In [177]:
# Encoding the values in Embarked.
from sklearn.preprocessing import LabelEncoder

temp_df = train_df
le = LabelEncoder()
temp_df["Embarked"]= le.fit_transform(temp_df["Embarked"])

In [187]:
# Correlation matrix for train_df.
mask = np.triu(np.ones_like(temp_df.corr(), dtype = bool))
corr = round(temp_df.corr(), 2).mask(mask)

corr_matrix = px.imshow(corr, text_auto = True)
corr_matrix.update_layout(height = 700, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, xaxis_showgrid = False, yaxis_showgrid = False)
corr_matrix.show()

In [183]:
# Seeing which values got encoded as what.
encoded_embarked_dict = dict(zip(le.classes_, np.sort(temp_df["Embarked"].unique())))
encoded_embarked_dict

{'C': 0, 'Q': 1, 'S': 2}

From the above graph we can see that `Pclass` and `Fare` have the highest correlation out of all the column pairs. A correlation coefficient of -0.6 means that higher the passenger's Pclass value, lower the fare they pay. Meaning, passengers of higher economic backgrounds (Pclass = 1) would pay the highest fares while passengers of lower economic backgrounds (Pclass = 3) would pay the lowest fares.

\

The correlation coefficient of 0.26 between `Fare` and `Survived` columns suggests that passengers who paid higher fares had higher survival rates.

\

The correlation coefficient of -0.33 between `Pclass` and `Survived` columns suggests that passengers with a higher Pclass value (lower economic background) had lower chances of survival.

\

The correlation coefficient of 0.42 between `SibSp` and `Parch` columns suggests that passengers who had siblings onboard had higher chances of having a parent/child onboard aswell and vice versa.

\

The correlation coefficient of -0.34 between `Age` and `Pclass` suggests that most passengers that belonged to a higher economic background were had more age, while most passengers that belonged to a lower economic background had lesser age.

### 2) Distribution based on Survival (Age and Fare)

Let's first check the difference in distributions of `Age` and `Fare` columns for people who survived and for people who didn't.

In [184]:
# Making a subplot.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{name} Distribution" for name in ["Age", "Fare"]])

# Plotting the KDE plot for Age.
age_data = [train_df.groupby("Survived")["Age"].get_group(group) for group in train_df["Survived"].unique()]
age_kde = ff.create_distplot(
    hist_data = age_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["lime", "orange"]
    )
fig.add_traces(age_kde.data, rows = 1, cols = 1)

# Plotting the KDE plot for Fare.
fare_data = [train_df.groupby("Survived")["Fare"].get_group(group) for group in train_df["Survived"].unique()]
fare_kde = ff.create_distplot(
    hist_data = fare_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["red", "turquoise"]
    )
fig.add_traces(fare_kde.data, rows = 1, cols = 2)

fig.update_layout(showlegend = False, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution based on Survival")
fig.update_layout({f"xaxis{i}": dict(showgrid = False) for i in range(1,3)})
fig.show()

#### **i) Analysis on Age.**
From the above graph we can see that the trends in `Age` are relatively the same for both groups.

#### **ii) Analysis on Fare.**
The trend in `Fare` however tells us that the people who paid a higher price for the ticket had a higher chance of survival. This observation kind of makes sense as people who paid a higher price would have been provived much safer cabins, and would have been given priority during the evacuation as they might have been influential people. This also suggests that the people who paid higher ticket fares might belong to a higher economic class. Meaning, if we see that the people belonging to a higher economic class had a higher survival rate, our hypothesis of **"priority during evacuation"** would have more support.

### 3) Distribution based on Survival (Pclass, Sex, and Embarked)

Now let's check whether a passenger's `Pclass`, `Sex`, or `Embarked` value could have affected their survival rate.

In [185]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 3, subplot_titles = [f"Distribution for {column}" for column in ["Pclass", "Sex", "Embarked"]], shared_yaxes = True)

# Grouped distribution for Pclass
survivor_by_pclass = px.histogram(data_frame = train_df, x = "Pclass", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_pclass.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_pclass.data, rows = 1, cols = 1)

# Grouped distribution for Sex
survivor_by_sex = px.histogram(data_frame = train_df, x = "Sex", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_sex.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_sex.data, rows = 1, cols = 2)

# Grouped distribution for Embarked
survivor_by_embarked = px.histogram(data_frame = train_df, x = "Embarked", color = "Survived", barmode = "group", text_auto = True, color_discrete_sequence = color_sequence)
survivor_by_embarked.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_embarked.data, rows = 1, cols = 3)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Variables grouped by survivors")
fig.update_layout({f"yaxis{i}":dict(range = [0,510]) for i in range(1, 4)})
fig.show()

From the above graph we can see that most of the passengers who weren't able to survive were -

1. From a lower economic background.
2. Male.
3. Boarded the ship at Southampton (Embarked = "S").

#### **i) Analysis on Pclass.**
Most of the people belonging to a lower economic backgroud (Pclass = 3) were not able to survive the disaster, while most of the people belonging to a higher economic background were able to (Pclass = 1). This provides support for our hypothesis of **priority during evacuation**. The reason behind such a trend might be the influence of these people on the society. It is very possible that the people of a higher economic background might have been big business owners, politicians or government officials.

#### **ii) Analysis on Sex.**

Out of the survivors most of the passengers were women (Sex = female), while out of the non-survivors most of the passengers were men (Sex = male). This might be because of the unofficial societial rule of "women and children first" during any kind of disaster.

#### **iii) Analysis on Embarked**

Most of the people who boarded the ship at Southamption (Embarked = "S") were not able to survive the disaster, while most of the people who boarded the ship at Cherbourg (Embarked = "C") were able to survive.

In [186]:
# Making a subplot
fig = make_subplots(rows = 1, cols = 2, subplot_titles = [f"{column} Distribution" for column in ["SibSp", "Parch"]], shared_yaxes = True, x_title = "Number of Relations")

# Plotting the KDE plot for SibSp.
sibsp_data = [train_df.groupby("Survived")["SibSp"].get_group(group) for group in train_df["Survived"].unique()]
sibsp_kde = ff.create_distplot(
    hist_data = sibsp_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["lime", "red"]
    )
fig.add_traces(sibsp_kde.data, rows = 1, cols = 1)

# Plotting the KDE plot for Parch.
parch_data = [train_df.groupby("Survived")["Parch"].get_group(group) for group in train_df["Survived"].unique()]
parch_kde = ff.create_distplot(
    hist_data = parch_data, group_labels = ["Didn't Survive", "Survived"], 
    show_rug = False, show_hist = False, colors = ["orange", "turquoise"]
    )
fig.add_traces(parch_kde.data, rows = 1, cols = 2)

fig.update_layout(plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, showlegend = False, title = "Relations Distribution")
fig.update_layout({f"xaxis{i}": dict(showgrid = False) for i in range(1,3)})
fig.show()