# Introduction

On September 28, 1994, 852 people die in one of the worst maritime disasters of the century when the Estonia, a large car-and-passenger ferry, sank in the Baltic Sea. It is named one of the deadliest ship-sinking tragedy second only to the Titanic Incident.

# The Dataset 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df = pd.read_csv("/kaggle/input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv")

In [None]:
df.head(),df.shape

In [None]:
df.isnull().sum()

In [None]:
df.set_index('PassengerId')

We have about 7 coloumns out of which the last 4 are of atmost importance for building this model.
* Sex : M or F 
* Age of the passenger
* Category : P- Passenger C-Crew
* Survived : Yes - 0, No - 1

# Data Pre-Processing

We'll convert the categorical variables such as Sex and Categories into 0s and 1s using Label Encoder so we can use them in our model building.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

df['Sex'] = label_encoder.fit_transform(df['Sex']) #Male is 1
df['Category'] = label_encoder.fit_transform(df['Category']) #P is 1

In [None]:
def age_group(Age):
  a=''
  if (Age<=1):
    a='toddler'

  elif(Age<=4):
    a='infant'

  elif(Age<=12):
    a='child'

  elif(Age<=19):
    a='teenager'
    
  elif(Age<=55):
    a='adult'

  else:
    a='senior'
    
  return a

df['age_group']=df.Age.map(age_group)

I have created a new coloumn using the Age of the passangers to fit them into age groups.

# Exploratory Data Analysis

Lets take a deeper look at our variables to know how much do they truly affect the Survival.

In [None]:
import matplotlib as plt 
import seaborn as sns

In [None]:
sns.heatmap(data = df.corr(), annot = True)

In [None]:
df['Sex'].value_counts()

In [None]:
sns.barplot(x='Sex',y='Survived', data =df)

As we can see the sex of the passenger is an important factor since the survival percentange of Males (1) is higher than Females(0)

In [None]:
df['Category'].value_counts()

In [None]:
sns.barplot(x='Category',y='Survived', data =df)

Looking at the Category Coloumn, the crew even at a lower number had a higher probability of survival compared to a regular passenger. So Category of passanger is also an important factor.

In [None]:
sns.kdeplot(df.loc[(df['Survived']==0),'Age'],shade = True ,Label='Not Survived')
sns.kdeplot(df.loc[(df['Survived']==1),'Age'],shade = True ,Label='Survived')

Finally Age, also an important variable to consider.

# Model Building

We'll be using the Random Forest Model to Predict results. I have tried to use the same model in 2 different ways:
1. Using Age coloumn
2. Using the age_group coloumn we created earlier.

Lets build the model and compare the results.

**1.Using the Age Coloumn**

In [None]:

X =df.loc[:,["Sex","Age","Category"]]
y =df.loc[:,["Survived"]]

In [None]:
from sklearn.model_selection import train_test_split
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.2, random_state=32, stratify=y)

Hyperparameter Tuning using RandomizedSearchCV 

Reference : https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [None]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_features = ['auto', 'sqrt']

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]

bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()


rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(train_X, train_y)

In [None]:
rf_random.best_params_

In [None]:
rf = RandomForestClassifier()

rf = RandomForestClassifier(n_estimators= 800,
 min_samples_split= 10,
 min_samples_leaf = 4,
 max_features= "sqrt",
 max_depth = 50,
 bootstrap = True)
                            
rf.fit(train_X, train_y)

preds2  = rf.predict(test_X)

**Validation**

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score
print("Accuracy: ",round((accuracy_score(test_y,preds2)*100),4),"%")

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
conf_mat = confusion_matrix(test_y, preds2)

In [None]:
sns.heatmap(conf_mat, annot=True, fmt='g') #plotting confusion matrix

****2.Using Age Group****

Now lets perform the same process this time using the age_group coloumn.

In [None]:
df_copy = df.copy()

Before we use the age_group coloumn we must convert it into a suitable format so that it can be used in our model building process. We'll again use the LabelEncoder to do this.

In [None]:
df_copy['age_group'] = label_encoder.fit_transform(df['age_group']) 

In [None]:
X2 =df_copy.loc[:,["Sex","age_group","Category"]]
y2 =df_copy.loc[:,["Survived"]]

In [None]:
from sklearn.model_selection import train_test_split
train_X2,test_X2,train_y2,test_y2 = train_test_split(X2,y2, test_size=0.2, random_state=32, stratify=y)

In [None]:
rf = RandomForestClassifier(random_state = 2)
rf.fit(train_X2, train_y2)

preds3 = rf.predict(test_X2)

**Validation**

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy: ",round((accuracy_score(test_y2,preds3)*100),4),"%")


In [None]:
conf_mat2 = confusion_matrix(test_y2, preds3)

In [None]:
sns.heatmap(conf_mat2, annot=True, fmt='g') 

Thank you for reading my notebook. If you have any suggestions or improvements do let me know.
Cheers!