<a href="https://colab.research.google.com/github/slyofzero/Kaggle-Notebooks/blob/main/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading the data

In [None]:
# !mkdir ~/.kaggle
# !cp /content/drive/MyDrive/C.S/Kaggle/kaggle.json ~/.kaggle/kaggle.json
# !kaggle competitions download -c titanic
# !unzip /content/titanic.zip

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 27.3MB/s]
Archive:  /content/titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


---

# EDA

In [None]:
# Importing all the neccessary modules.
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
# Loading the files.
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now that we have our data loaded, let's check the types of columns we have here and which of them have null values.

In [None]:
# Checking the datatype in each column.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
# Checking for null values.
round((train_df.isna().sum() / train_df.shape[0]) * 100, 2)

PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

From the above output we can see that the `Age` column has 19.87% null values, the `Cabin` column has 77.1% null values and `Embarked` column has 0.22% of null values.

Because the `Cabin` has has > 30% of null values, we can drop it. For the `Age` column we'll replace all the null values with the median values and for the `Embarked` column we we'll replace all the null values with the mode (we'll use mode instead of median because the data in the column is categorical).

In [None]:
# Handling null values.
train_df = train_df.drop(columns = ["PassengerId", "Name", "Cabin", "Ticket"])
train_df["Age"].fillna(value = train_df["Age"].median(), inplace = True)
train_df["Embarked"].fillna(value = train_df["Embarked"].mode()[0], inplace = True)

In [None]:
# Checking if the data got cleaned.
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


From the above output we can see that our target column, `Survived`, has categorical data in it. To check its distribution we can plot the counts of the values in it on a bar graph.

In [None]:
# Plotting the counts of the values in the Survived column as a bar graph.
bgcolor = "#0a0a33"
color_sequence = ["#3562e8", "#eb3462", "#a7c8d1", "#ccc504"]
font_color = "white"
border_color = "black"
labels = ["Didn't Survive", "Survived"]

survivor = px.histogram(
    data_frame = train_df, x = "Survived", color = "Survived", color_discrete_sequence = color_sequence,
    title = "Survivor Data"
    )

survivor.update_layout(bargap = 0.6)

survivor.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
survivor.update_layout(
    height = 650, plot_bgcolor = bgcolor, paper_bgcolor = bgcolor, font_color = font_color, 
    bargap = 0.6, xaxis = dict(tickmode = "array", tickvals = [0, 1], ticktext = labels))

survivor.show()

From the graph above, we can infer that most of the people who were on the ship were not able to escape in time. 549 people were not able to survive the disaster while 342 were able to. Let's check for the possible factors that contributed in a person's survival rate.

From the dataframe information we got earlier we can infer that `Age` and `Fare` are continous variables. So let's plot histograms to check their distribution.

In [None]:
# Plotting histograms for Age and Fare.
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"])

ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence, title = "Fare Distribution")
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence, title = "Fare Distribution")
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fig.add_traces(ages.data, rows = 1, cols = 1)
fig.add_traces(fares.data, rows = 1, cols = 2)
fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

The `Age Distribution` seems to be normally distributed but the `Fare Distribution` seems to have a few outliers of 500 which is ruining the entire graph. Let's remove any `Fare` values that are above 267.5 and plot the graph again.

In [None]:
# # Plotting histograms for Age and Fare.
train_df = train_df[train_df["Fare"] <= 267.5]

fig = make_subplots(rows = 1, cols = 2, subplot_titles = ["Age Distribution", "Fare Distribution"])

ages = px.histogram(data_frame = train_df, x = "Age", color_discrete_sequence = color_sequence, title = "Fare Distribution")
ages.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fares = px.histogram(data_frame = train_df, x = "Fare", color_discrete_sequence = color_sequence, title = "Fare Distribution")
fares.update_traces(marker = dict(line = dict(width = 1, color = border_color)))

fig.add_traces(ages.data, rows = 1, cols = 1)
fig.add_traces(fares.data, rows = 1, cols = 2)
fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Continous Variables")
fig.show()

From the graphs above we can infer that most of the people onboard the ship were about 30 years old and didn't spend much on the ticket fare.

Now let's check the distribution of `Pclass`, `Sex`, `SibSp`, and `Parch`

In [None]:
fig = make_subplots(rows = 2, cols = 2, subplot_titles = [f"Distribution for {column}" for column in ["Pclass", "Sex", "Sibsp", "Parch"]], shared_yaxes = True)

# Plotting graph for Pclass
pcounts = train_df["Pclass"].value_counts().reset_index(name = "P_counts")
pcounts["index"] = pcounts["index"].astype(str)

bar_fig1 = go.Figure()
bar_obj1 = go.Bar(x = pcounts["index"], y = pcounts["P_counts"], marker = dict(color = color_sequence), text = pcounts["P_counts"], textposition = "outside")
bar_fig1.add_trace(bar_obj1)
bar_fig1.update_traces(width = 0.4, marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(bar_fig1.data, rows = 1, cols = 1)

# Plotting graph for Sex
sex_counts = train_df["Sex"].value_counts().reset_index(name = "Sex")
sex_counts["index"] = sex_counts["index"].astype(str)

bar_fig2 = go.Figure()
bar_obj2 = go.Bar(x = sex_counts["index"], y = sex_counts["Sex"], marker = dict(color = color_sequence), text = sex_counts["Sex"], textposition = "outside")
bar_fig2.add_trace(bar_obj2)
bar_fig2.update_traces(width = 0.4, marker = dict(line = dict(width = 1, color = border_color)))

fig.add_traces(bar_fig2.data, rows = 1, cols = 2)

# Plotting graph for SibSp
sex_counts = train_df["SibSp"].value_counts().reset_index(name = "SibSp")
sex_counts["index"] = sex_counts["index"].astype(str)

bar_fig3 = go.Figure()
bar_obj3 = go.Bar(x = sex_counts["index"], y = sex_counts["SibSp"], marker = dict(color = color_sequence), text = sex_counts["SibSp"], textposition = "outside")
bar_fig3.add_trace(bar_obj3)
bar_fig3.update_traces(width = 0.4)

fig.add_traces(bar_fig3.data, rows = 2, cols = 1)

# Plotting graph for Parch
sex_counts = train_df["Parch"].value_counts().reset_index(name = "Parch")
sex_counts["index"] = sex_counts["index"].astype(str)

bar_fig4 = go.Figure()
bar_obj4 = go.Bar(x = sex_counts["index"], y = sex_counts["Parch"], marker = dict(color = color_sequence), text = sex_counts["Parch"], textposition = "outside")
bar_fig4.add_trace(bar_obj4)
bar_fig4.update_traces(width = 0.4)

fig.add_traces(bar_fig4.data, rows = 2, cols = 2)

fig.update_layout(height = 700, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Discrete Variables", showlegend = False)
fig.update_layout({'yaxis'+str(i+1): dict(range=[0,750]) for i in range(4)})
fig.show()

In [None]:
fig = make_subplots(rows = 1, cols = 2, shared_yaxes = True)

pclass = px.histogram(data_frame = train_df, x = train_df["Pclass"].astype(str), color = "Pclass")
fig.add_traces(pclass.data, rows = 1, cols = 1)

sex = px.histogram(data_frame = train_df, x = train_df["Sex"].astype(str), color = "Sex")
fig.add_traces(sex.data, rows = 1, cols = 2)

fig.update_layout({f"yaxis{i}":dict(range = [0,800]) for i in range(1, 5)})
fig.show()

In [None]:
fig = make_subplots(rows = 2, cols = 2)

# Pclass counts
hist1 = go.Figure(
    data = go.Histogram(x = train_df["Pclass"].astype(str), marker_color = color_sequence))
hist1.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(hist1.data, rows = 1, cols = 1)

# Sex counts
hist2 = go.Figure(
    data = go.Histogram(x = train_df["Sex"].astype(str), marker_color = color_sequence))
hist2.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(hist2.data, rows = 1, cols = 2)

# SibSp counts
hist3 = go.Figure(
    data = go.Histogram(x = train_df["SibSp"].astype(str), marker_color = color_sequence))
hist3.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(hist3.data, rows = 2, cols = 1)

# Parch counts
hist4 = go.Figure(
    data = go.Histogram(x = train_df["Parch"].astype(str), marker_color = color_sequence))
hist4.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(hist4.data, rows = 2, cols = 2)

fig.update_layout(height = 700, paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, title = "Distribution of Discrete Variables", showlegend = False)
fig.update_layout({'yaxis'+str(i+1): dict(range=[0,850]) for i in range(4)})
fig.show()

From the above graphs we can infer that the ship mostly had passengers who -
1. Had a 3$^{rd}$ class ticket.
2. Were male.
3. Had no relatives onboard.

Now let's check which of these factors affected in a person's survival.

In [None]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [None]:
fig = make_subplots(rows = 2, cols = 2)

survivor_by_sex = px.histogram(data_frame = train_df, x = "Sex", color = "Survived", barmode = "group", text_auto = True)
survivor_by_sex.update_traces(marker = dict(line = dict(width = 1, color = border_color)))
fig.add_traces(survivor_by_sex.data)

fig.update_layout(paper_bgcolor = bgcolor, plot_bgcolor = bgcolor, font_color = font_color, height = 750)