# **Titanic - Machine Learning from Disaster**

> <center><img src="https://awsimages.detik.net.id/community/media/visual/2019/08/29/b82a0315-099c-4252-9af8-bd3831756de5.jpeg?w=700&q=80" width="1400px"></center>

<h1 style='color:white;background-color:black' > Table of Contents </h1>

* [Introduction](#introduction)
* [Data Acquisition](#data_acquisition)
* [Exploratory Data Analysis (EDA)](#eda)
    - [Distribution](#distribution)
    - [Correlation](#correlation)
    - [Check Missing Value](#missing_value)
    - [Outlier Detection](#outlier)
* [Data Cleaning and Preprocessing](#cleaning)
    - [Check Duplicate Data](#duplicate)
    - [Drop Unwanted Data](#drop)
    - [Completing a Numerical Continuous Feature](#continuous_feature)
    - [Creating New Features](#new_feature)
    - [Convert Categorical Features](#convert_feature)
* [Building the ANN](#ann)
    - [Add Layers](#add_layers)
    - [Train the Model](#train_model)
    - [Evaluation](#evaluation)
* [Submission](#submission)

<a id="introduction"></a>
## 1. Introduction

<div align='left'><font size="3" color="#000000"> The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
</font></div>

* **Goal:**
<div align='left'><font size="3" color="#000000"> Build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
</font></div>

<a id="data_acquisition"></a>
## 2. Data Acquisition

### Import Library

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable as dt
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline

# Tensorflow
import tensorflow as tf
from tensorflow.keras import callbacks
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

# Scaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Split
from sklearn.model_selection import train_test_split

# Scoring
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix
from sklearn.metrics import classification_report

# SMOTE
from imblearn.over_sampling import SMOTE

# Removes warning
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load Data

In [None]:
# Using datatable for faster loading

train_df = dt.fread(r'/kaggle/input/titanic/train.csv').to_pandas()
test_df = dt.fread(r'/kaggle/input/titanic/test.csv').to_pandas()

print("Data is loaded")

<a id="eda"></a>
## 3. Exploratory Data Analysis (EDA)

### Variable Notes
##### **Pclass:**
<div align='left'><font size="3" color="#000000">A proxy for socio-economic status (SES).
</font></div>

* <div align='left'><font size="3" color="#000000"> 1st = Upper
</font></div>
* <div align='left'><font size="3" color="#000000"> 2nd = Middle
</font></div>
* <div align='left'><font size="3" color="#000000"> 3rd = Lower
</font></div>


##### **SibSp:**
<div align='left'><font size="3" color="#000000"> The dataset defines family relations in this way
</font></div>

* <div align='left'><font size="3" color="#000000"> Sibling = brother, sister, stepbrother, stepsister
</font></div>
* <div align='left'><font size="3" color="#000000"> Spouse = husband, wife (mistresses and fiancés were ignored)
</font></div>

##### **Parch:**
<div align='left'><font size="3" color="#000000"> The dataset defines family relations in this way
</font></div>

* <div align='left'><font size="3" color="#000000"> Parent = mother, father
</font></div>
* <div align='left'><font size="3" color="#000000"> Child = daughter, son, stepdaughter, stepson
</font></div>
* <div align='left'><font size="3" color="#000000"> Some children travelled only with a nanny, therefore parch=0 for them.
</font></div>

In [None]:
print ("Train: ",train_df.shape[0],"passenger, and ",train_df.shape[1],"features")
print ("Test: ",test_df.shape[0],"passenger, and ",test_df.shape[1],"features")

In [None]:
train_df.head()

In [None]:
train_df.describe()

In [None]:
train_df

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
sns.countplot(x='Survived',data=train_df)

In [None]:
sns.countplot(x='Sex',data=train_df)

In [None]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age')

In [None]:
g = sns.FacetGrid(data=train_df,col='Sex')
g.map(plt.hist,'Age')

In [None]:
plt.figure(1, figsize=(15, 8))
for i, x in enumerate(['Pclass', 'Age','SibSp','Parch','Fare','Embarked']):
    plt.subplot(2, 3, i+1)
    plt.tight_layout()
    sns.histplot(train_df[x])
    plt.title('{}'.format(x))
plt.show()

In [None]:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

In [None]:
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

<a id="correlation"></a>
### 3.2 Correlation

In [None]:
corr = train_df.corr()
plt.subplots(figsize=(8,7))
sns.heatmap(corr, vmax=0.9, cmap='coolwarm', square=True)

In [None]:
train_df.corr()['Survived'].sort_values()

<a id="missing_value"></a>
### 3.3 Check Missing Value

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

<a id="outlier"></a>
### 3.4 Outlier Detection

#### Fare (There are Outliers)

In [None]:
sns.boxplot(train_df["Fare"])
plt.show()

In [None]:
Q1 = train_df["Fare"].quantile(0.25)
Q3 = train_df["Fare"].quantile(0.75)
IQR = Q3-Q1
lower_range = Q1 -(1.5 * IQR)
upper_range = Q3 +(1.5 * IQR)

print("Score for lower range:", lower_range)
print("Score for upper range:", upper_range)

In [None]:
train_df.loc[(train_df["Fare"]>upper_range),:]

#### Age (There are Outliers)

In [None]:
sns.boxplot(train_df["Age"])
plt.show()

In [None]:
Q1 = train_df["Age"].quantile(0.25)
Q3 = train_df["Age"].quantile(0.75)
IQR = Q3-Q1
lower_range = Q1 -(1.5 * IQR)
upper_range = Q3 +(1.5 * IQR)

print("Score for lower range:", lower_range)
print("Score for upper range:", upper_range)

In [None]:
train_df.loc[(train_df["Age"]>upper_range),:]

<a id="cleaning"></a>
## 4. Data Cleaning and Preprocessing

<a id="duplicate"></a>
### 4.1 Check Duplicate Data

In [None]:
train_df.shape

In [None]:
train_df = train_df.drop_duplicates()

In [None]:
train_df.shape

<a id="drop"></a>
### 4.2 Drop Unwanted Data

In [None]:
print('Before deletion for train data: ' + str(train_df.shape))
print('Before deletion for test data: ' + str(test_df.shape))

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

print('\nAfter deletion for train data: ' + str(train_df.shape))
print('After deletion for test data: ' + str(test_df.shape))


<a id="continuous_feature"></a>
### 4.3 Completing a Numerical Continuous Feature

#### 4.3.1 Create AgeBands and Replace Value of Age Feature

In [None]:
train_df['AgeBands'] = pd.qcut(train_df['Age'], 5)
train_df['AgeBands'].unique()

In [None]:
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 19, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 19) & (dataset['Age'] <= 25), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 25) & (dataset['Age'] <= 31.8), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 31.8) & (dataset['Age'] <= 41), 'Age'] = 3
    dataset.loc[dataset['Age'] > 41, 'Age'] = 4
    
train_df.head()

In [None]:
train_df['Age'].value_counts()

In [None]:
# Check missing value in Age feature from train data
train_df['Age'].isnull().sum()

In [None]:
# check the most frequent value in Age feature
freq_age_train = train_df['Age'].dropna().mode()[0]
freq_age_test = test_df['Age'].dropna().mode()[0]

In [None]:
train_df['Age'] = train_df['Age'].fillna(freq_age_train)
test_df['Age'] = test_df['Age'].fillna(freq_age_test)

In [None]:
train_df = train_df.drop(['AgeBands'], axis=1)
combine = [train_df, test_df]
train_df.head()

#### 4.3.2 Create FareBands and Replace Value of Fare Feature

In [None]:
# Impute missing value in test data for column 'Fare' with its median
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()

In [None]:
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

In [None]:
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
train_df.head(10)

<a id="new_feature"></a>
### 4.4 Creating New Features

#### 4.3.1 Creating 'Title' Feature

In [None]:
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
                                                 'Don', 'Dr', 'Major', 'Rev', 'Sir',
                                                 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

In [None]:
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

#### 4.3.2 Creating 'IsAlone' Feature

In [None]:
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

In [None]:
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

train_df.head()

#### 4.3.3 Create 'Age*Class' Feature

In [None]:
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

<a id="convert_feature"></a>
### 4.5 Convert Categorical Features

#### 4.4.1 Encode Sex Feature

In [None]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()

#### 4.4.1 Encode Embarked Feature

In [None]:
train_df['Embarked'].value_counts()

In [None]:
# Fill missing value with the most frequent value
freq_embarked = train_df['Embarked'].dropna().mode()[0]
freq_embarked

In [None]:
train_df['Embarked'].replace({'':freq_embarked},inplace=True)

In [None]:
train_df['Embarked'].value_counts()

In [None]:
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
train_df.head()

#### 4.4.2 Encode Survived Feature

In [None]:
train_df["Survived"] = train_df["Survived"].astype(int)
train_df.head()

<a id="ann"></a>
## 5. Building the ANN

In [None]:
X = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#y_train.value_counts()

In [None]:
# Handle imbalance class using oversampling minority class with SMOTE method
#os = SMOTE(sampling_strategy='minority',random_state = 1,k_neighbors=5)
#train_smote_X,train_smote_Y = os.fit_resample(X_train,y_train)
#X_train = pd.DataFrame(data = train_smote_X, columns = X_train.columns)
#y_train = pd.DataFrame(data = train_smote_Y)

In [None]:
#y_train.value_counts()

In [None]:
#scaling the data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

<a id="add_layers"></a>
### 5.1 Add Layers

* The Artificial Neural Network consists of an input layer, a hidden layer, and an output layer.

> <center><img src="https://elogeel.files.wordpress.com/2010/05/050510_1627_multilayerp1.png" width="500px"></center>

In [None]:
model = Sequential()

In [None]:
# First hidden layer
model.add(Dense(100, activation='swish'))
model.add(Dropout(0.5))

In [None]:
# Second hidden layer
model.add(Dense(100, activation='swish'))
model.add(Dropout(0.5))

In [None]:
# Output layer
model.add(Dense(1, activation='sigmoid'))

<a id="train_model"></a>
### 5.2 Train the Model

In [None]:
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['Accuracy'])

In [None]:
earlystopping = callbacks.EarlyStopping(monitor='val_loss',
                                        mode='min',
                                        verbose=1,
                                        patience=70)

In [None]:
history = model.fit(X_train, y_train,validation_data=(X_test,y_test), batch_size = 32, epochs = 200,callbacks =[earlystopping])

In [None]:
# summarize history for acc
plt.plot(history.history['Accuracy'])
plt.plot(history.history['val_Accuracy'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

In [None]:
print('Max val_acc achieved: %.2f' %(max(history.history['val_Accuracy'])*100), '%')
print('Max acc achieved: %.2f' %(max(history.history['Accuracy'])*100), '%')

In [None]:
print('Final val_acc achieved: %.2f' %(history.history['val_Accuracy'][-1]*100), '%')
print('Final acc achieved: %.2f' %(history.history['Accuracy'][-1]*100), '%')

In [None]:
val_accuracy = np.mean(history.history['val_Accuracy'])
print("\n%s: %.2f%%" % ('Mean of validation accuracy', val_accuracy*100))

<a id="evaluation"></a>
### 5.3 Evaluation

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred = (y_pred > 0.5)

In [None]:
ann_acc = round(accuracy_score(y_pred,y_test) * 100, 2)
print('Model Accuracy:',ann_acc,'%')

In [None]:
ann_cm = confusion_matrix(y_test, y_pred)
cmap1 = sns.diverging_palette(275,150,  s=40, l=65, n=6)
plt.subplots(figsize=(10,6))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(ann_cm/np.sum(ann_cm), cmap = cmap1, annot = True, annot_kws = {'size':15})

In [None]:
print(classification_report(y_pred,y_test))

<a id="submission"></a>
## 7. Submission

In [None]:
test_model = test_df.drop(['PassengerId'], axis=1)

In [None]:
test_model = scaler.transform(test_model)

In [None]:
pred = model.predict(test_model)

In [None]:
pred = (pred > 0.5)

In [None]:
pred = pred.astype(int)

In [None]:
sub = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")

In [None]:
sub['Survived'] = pred

In [None]:
sub.head(20)

In [None]:
sub.to_csv("submission.csv",index=False)

<a id="reference"></a>
## 8. Reference

#### **Source and special thanks to:**
* [<div align='left'><font size="3" color="#000000"> https://www.kaggle.com/mostafaalaa123/simple-solution-for-titanic/notebook#ML-Models
</font></div>](https://www.kaggle.com/mostafaalaa123/simple-solution-for-titanic/notebook#ML-Models)
* [<div align='left'><font size="3" color="#000000"> https://www.kaggle.com/startupsci/titanic-data-science-solutions
</font></div>](https://www.kaggle.com/startupsci/titanic-data-science-solutions)