# Introduction

![](https://i.ibb.co/hB0CV9W/r-MS-Estonia-model.jpg)
sourced from https://www.kaggle.com/christianlillelund/passenger-list-for-the-estonia-ferry-disaster

**This dataset is based on a true incident of Sinking of an Estonian Ship in 1994, which is considered as worst maritime disaster postwar**

**About Dataset**
1. Country		
2. Firstname	
3. Lastname		
4. Sex	Gender of passenger	
5. Age		
6. Category--The type of passenger	C = Crew, P = Passenger
7. Survived

# Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
from pandas_profiling import ProfileReport
import tensorflow as tf
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objects as go


# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading Dataset

In [None]:
df = pd.read_csv('../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv')
df.shape

In [None]:
df.head(5)

# Let's start with EDA
**First let's use .describe() to understand the satistical parameters of our dataset**

In [None]:
df.describe()

**Let's check the missing values now**

In [None]:
df.isnull().sum().sum()

**Great!! our dataset is clean now and our work becomes easier now**

# EDA and Pandas Profiling
**Using ProfileReport of pandas let's explore the dataset**

In [None]:
ProfileReport(df)

**So here ProfileReport did an concise display of features pertaining to this dataset, like Correlation, basic statistics etc. all in onle line of code!!**

# Correlation

In [None]:
color = plt.cm.plasma
sns.heatmap(df.corr(), annot=True, cmap=color)

**So here not much of correlation is there between the datasets, so no effect of columns on each other**

In [None]:
df.Age.max()

# Data Visualizations

# How many Males and Females onboard?

In [None]:
labels = ['Male', 'Female']
values = df.Sex.value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, textinfo='label+percent', values=values)])
fig.show()

**So we have a somewhat equal Gender wise representation here**

# Passengers from Which Countries

In [None]:
labels = df.Country.value_counts().index
values = df.Country.value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, textinfo='label+percent', values=values)])
fig.show()

# Survived

In [None]:
labels = ['Dead', 'Survived']
values = df.Survived.value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, textinfo='label+percent', values=values)])
fig.show()

# Who Survived More?

In [None]:
sns.countplot(x='Category', hue='Sex', data=df).set_title('Gender wise survivors distrbution')

In [None]:
male_survived = df['Sex'][(df['Sex']=='M') & (df['Survived']==1)].count()
female_survived = df['Sex'][(df['Sex']=='F') & (df['Survived']==1)].count()

male_all = df['Sex'][df['Sex']=='M'].count()
female_all =df['Sex'][df['Sex']=='F'].count()

perc_male = male_survived/male_all
perc_female = female_survived/female_all

print('Proportion of Male passengers that survived {:0.2f} '.format(perc_male*100))
print('Proportion of Female passengers that survived {:0.2f} '.format(perc_female*100))

**Survivors among crew and passengers**

In [None]:
sns.countplot(x='Category', hue='Survived', data=df).set_title('Passenger and Crew Survived Distribution')

**0 means dead, 1 means survived**

# Let's do some Feature Engineering here

In [None]:
df.drop(['Country', 'Firstname', 'Lastname'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.drop(['Category', 'Sex'],axis=1, inplace=True)
df.head()

In [None]:
df.isnull().sum()

In [None]:
x = df.drop(['Survived'], axis=1)
y = df['Survived']

# Preprocessing 

In [None]:
sc=StandardScaler()
sc.fit(df.drop(['Survived', 'PassengerId'], axis = 1))
x_train = sc.transform(df.drop(['Survived', 'PassengerId'], axis = 1))

# Model Develpoment

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 5)

# Random Forest Classifier

In [None]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
preds = rfc.predict(x_test)
score = rfc.score(x_test,y_test)

In [None]:
score

In [None]:
preds[:10]

In [None]:
ground_truth = y_test[:10]
ground_truth

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(preds, y_test)
print('The Confusion Matrix : \n', cm)

In [None]:
sns.heatmap(cm, annot = True, cmap='coolwarm')

In [None]:
cf = classification_report(preds, y_test)
print('The Report : \n', cf)

# XGB

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(x_train,y_train)

In [None]:
xgb_preds = xgb.predict(x_test)
xgb_score = xgb.score(x_test,y_test)
print('The Accuracy :',xgb_score)

In [None]:
cm = confusion_matrix(preds, y_test)
print('The Confusion Matrix : \n', cm)
sns.heatmap(cm, annot=True, cmap=color)

In [None]:
cf = classification_report(xgb_preds, y_test)
print('The Report : \n', cf)