# Predicting Estonia Disaster Survival using Machine Learning

## First, let's import the data and explore our problem definition

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
est_dis = pd.read_csv("../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv") #est_dis indicates Estonia Disaster, however you can use **df** if it's confusing
est_dis

###  By exploring the data, we can say the problem we gonna explore is  **Binary Classification**

**What is Binary Classification**

Binary classification is to classify objects into two groups based on some features

# Second,Some EDA
EDA stands for Exploratory Data Analysis 

In [None]:
#Checking top 5 rows of our data
est_dis.head()

In [None]:
#Checking how many passenger survived(1) and non-survived(0)
est_dis.Survived.value_counts()

In [None]:
#Checking whether any missing data
est_dis.isnull().sum()

**Our data don't have any missing values**

In [None]:
#let's find out survival percentage
surv_percnt = est_dis.Survived.value_counts()[1]/len(est_dis)*100
print('Percentage of survived passengers: ' "{:.2f}".format(surv_percnt)+'%')

**Now, let's find out the number of total passengers and crew members**


In [None]:
est_dis.Category.value_counts()

**P** stands for Passenger

**C** stands for Crew Members

In [None]:
#Now, let's check total number of male and females
est_dis.Sex.value_counts()

## Visualize our data

**Checking Survivability by sex wise**

In [None]:
pd.crosstab(est_dis.Sex, est_dis.Survived)

In [None]:
survivedBySex = est_dis.groupby('Sex')['Survived'].mean()
survivedBySex

In [None]:
#plotting Survivability sex wise
%matplotlib inline
plt.style.use('seaborn-whitegrid')
fig , ax = plt.subplots(figsize=(10,6))
ax = survivedBySex.plot.bar()
ax.set(xlabel='Sex',
      ylabel='Survived',
      title='Survival rate by Sex');

**Checking survivability age-wise** 

In [None]:
survivedByAge = est_dis.groupby('Age')['Survived'].mean()
survivedByAge

In [None]:
#The above information wasn't so helpful but if we plot these data, it may make sense
fig, ax = plt.subplots(figsize=(10,6))
ax = survivedByAge.plot.bar()
ax.set(xlabel='Age',
      ylabel='Survived',
      title='Survival rate Age wise');

**let's visualize data category wise**


In [None]:
survivedByCategory = est_dis.groupby('Category')['Survived'].mean()
survivedByCategory

In [None]:
#Let's plot the above data
fig, ax = plt.subplots(figsize=(10,6))
ax = survivedByCategory.plot.bar()
ax.set(xlabel='Category',
      ylabel='Survived',
      title='Survival Category wise');

**Plotting survivability against country**

In [None]:
survivedByCountry = est_dis.groupby('Country')['Survived'].mean()
survivedByCountry

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax = survivedByCountry.plot.bar()
ax.set(xlabel='Country',
      ylabel='Survived',
      title='Survival Country wise');

# Third, Fitting our data into a model
**We have done enough EDA, let proceeeds forward to modelling**

as it's a classification problem we will first evaluate score on KNN(K nearest neighbors), RandomForestClassifier and LogisticRegression.

which one have better score, we will proceed with that model and hypertune parameters 

In [None]:
#let's drop Firstname, lastname, PassengerId columns
est_dis.drop(['PassengerId','Firstname','Lastname'], axis=1, inplace=True)

In [None]:
est_dis.head()

In [None]:
est_dis.info()

`Country` `Sex` `Survived` columns not in integers, we need to convert them to integers before moving forward

In [None]:
#using labelencoder to convert all strings into integers in the dataframe
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for item in list(est_dis.columns):
    if est_dis[item].dtype=='object':
        est_dis[item]= le.fit_transform(est_dis[item])

In [None]:
est_dis.info()

##### Now all columns converted to integers, let's split our data into train, test model

In [None]:
#splitting data into X and y
X = est_dis.drop('Survived', axis=1)
y = est_dis['Survived']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                test_size=0.2)

In [None]:
#importing all the models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
#putting all the models in a dictionary
models = {"LogisticRegression":LogisticRegression(),
         "KNeighboursClassifier": KNeighborsClassifier(),
         "RandomForestClassifier":RandomForestClassifier()}

In [None]:
#Creating a function to fit our data in models and evaluate score
def fit_score(models,X_train,X_test,y_train,y_test):
    np.random.seed(40) #so our results can be reproducable
    evaluate = {} #this empty list will contain our evaluated score
    for name, model in models.items():
        model.fit(X_train,y_train) #fitting trained data in a model
        evaluate[name]= model.score(X_test,y_test) #evaluate score on test data
    return evaluate

In [None]:
evaluate= fit_score(models=models,
                   X_train=X_train,
                   X_test=X_test,
                   y_train=y_train,
                   y_test=y_test)
evaluate

**as `LogisticRegression` gives slightly better result than other models, we will hypertune parameters of `Logistic Regression ` and try to improve our model**

you can also hypertune parameters of other models for better result but here i'm going with `Logistic Regression`

### Fourth, Hypertuning parameters of `Logistic regression` and evaluate `accuracy` score

In [None]:
#Different logistic Regression parameters
param_grid = {"C": np.logspace(-4,4,20),
               "solver":["liblinear"]}
from sklearn.model_selection import GridSearchCV
np.random.seed(55)
grid_log_reg = GridSearchCV(LogisticRegression(),
                           param_grid=param_grid,
                           cv=5,
                           verbose=True)
grid_log_reg.fit(X_train,y_train)

In [None]:
grid_log_reg.best_params_

In [None]:
grid_log_reg.score(X_test,y_test)

**There is a slight improvement after hypertuning**