# Titanic: Machine Learning from Disaster

## Table of Contents

1. [Import Libraries](#import-libraries)
1. [Get the Data](#get-the-data)
    1. [Take a quick look at data structure](#quick-look)
    1. [Create a test set](#create-test)
1. [Discover and Visualize the Data to Gain Insights (EDA)](#eda)
    1. [Discover and Visualize](#eda-main)
    1. [Looking for Correlations](#cor)
    1. [EDA Results](#eda-res)
1. [Feature Engineering](#feature)
1. [Machine Learning](#ml)
1. [Fine-Tune the Model](#evaluate)

<a name='import-libraries'></a>
# Import libraries

This notebook is using the end to end machine learning technique from [Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) (CH2 or Appendix B) book.

First part is skipped because this problem is not a business problem etc.

Now we can import all libraries for this problem. Making this in the first step will led you to making the plan for further steps.

In [None]:
# Essential libraries
import pandas as pd
import numpy as np
import time

# Data Viz libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Consistent plots
from pylab import rcParams

rcParams['figure.figsize'] = 20,5
rcParams['xtick.labelsize'] = 9
rcParams['ytick.labelsize'] = 9
rcParams['axes.labelsize'] = 10

# Feature engineering libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.compose import ColumnTransformer

# Classifiers
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Evaluation libraries
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [None]:
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")
df = pd.concat([train,test])

<a name='get-the-data'></a>
# Get the Data

<a name='quick-look'></a>
### Take a Quick Look at the Data Structure

In [None]:
df.head()

In [None]:
df.info()

We see that there are null values in the 'Age', 'Cabin', and 'Embarked' columns, we will take care of it later. Also, we can see that there are 891 instances, which means a fairly small dataset. Now, we may investigate all categorical attributes in order to secure the data.

In [None]:
df.Survived.value_counts()

* Target values seems fairly distributed, no needs for dummy values.
* Now let's look at the summary of the numerical data.

In [None]:
df.describe()

* Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.
* We may need to transform 'Fare' attribute or cut it to batches

<a name='create-test'></a>
## Create a test set

We need to separate the df dataset into the test and train set to block data snooping.

In [None]:
eda, df_test = train_test_split(train, test_size=0.25, random_state=42)

eda.head()

<a name='eda'></a>
# Discover and Visualize the Data to Gain Insights (EDA)

So far we have only taken a quick glance at the data to get a general understanding of
the kind of data we are manipulating. Now the goal is to go into a little more depth.
<a name='eda-main'></a>
## Discover and Visualize the Data

In [None]:
eda.columns

Now we are going to analyze each attribute for their,
* Name
* Type
* % of missing values
* Noisiness and type of noise
* Usefulness for the task
* Type of distribution

We can easily say `PassengerId`, `Name` and `Ticket` attributes are not useful for the task. Thus, we can skip analyzing these columns.

In [None]:
# Easy to use function to plot each categorical data
def catplot(col):
    f, axes = plt.subplots(1, 3, sharex=True)
    sns.stripplot(
        data = eda,
        x = col,
        y = 'Age',
        hue = 'Survived',
        jitter = True,
        ax  = axes[0]
    )
    sns.countplot(
        data = eda,
        x = col,
        hue = 'Survived',
        ax  = axes[1]
    )
    sns.violinplot(
        data=eda,
        x=col,
        y='Age',
        hue='Survived',
        ax=axes[2]
    )
    print('Name:', col)
    print('Type:', type(col))
    print('% of missing values:', df[col].isnull().sum())

* `Pclass` attr,

In [None]:
catplot('Pclass')

* `Sex` attr,

In [None]:
catplot('Sex')

* `SibSp` attr,

In [None]:
catplot('SibSp')

* `Parch` attr,

In [None]:
catplot('Parch')

* `Embarked` attr,

In [None]:
catplot('Embarked')

It can be seen that third class has the most death rate in contrast first class has least death rate.

<a name='cor'></a>
## Looking for Correlations

In [None]:
attributes = ["Age", "Fare",'Survived']

sns.pairplot(eda[attributes],hue='Survived')

Obviously, there is no correlation.

<a name='eda-res'></a>
## EDA Results
* `PassengerId`,`Name` and `Ticket` have no effect on exploring the data. They should be removed.
* There is huge correlation between class of passenger and whether he survived or not `Pclass`.
* Same as with `Pclass` attribute `Sex` has correlated with surviving chance of passenger.
* `SibSp` and `Parch` attributes can be combined as number of family member aboard.
* There is 2 missing values for `Embarked` and lot more for `Age` attribute, they can filled in [Feature Engineering](#feature).
* There are also lot of missing values in `Cabin` attribute, however attribute is not explanatory to predict whether passenger survived or not. Thus, it can be dropped from `df`.
* `Fare` is not distributed very  well as continous variable cutting them into the bins would help.


<a name ='feature'></a>
# Feature Engineering
Itâ€™s time to prepare the data for your Machine Learning algorithms. Instead of doing this manually, you should write functions for this purpose, for several good reasons.


* Simple imput for `Embarked` and `Fare`

In [None]:
df_fill = df.copy()
df_fill['Fare'].fillna(df_fill['Fare'].median(), inplace = True)
df_fill['Embarked'].fillna(df_fill['Embarked'].mode().iloc[0], inplace = True)
df_fill.head()

* Set index as `PassengerId`

In [None]:
# Set index as passengerId
df_index = df_fill.set_index('PassengerId')
df_index.head()

* Impute `Age` according to title of each passenger

In [None]:
# Split Name into 3 part
df_split_name = df_index.copy()
df_split_name.insert(1,'Title',df_split_name['Name'].str.extract('([A-Za-z]+)\.', expand=True)[0])

# Replacing rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
df_split_name.replace({'Title': mapping}, inplace=True)

# Iterate for each title
for ind, title in enumerate(df_split_name.Title.value_counts().index):
    median_age = df_split_name.groupby('Title').Age.median()[title]
    df_split_name.loc[ (df_split_name.Age.isnull()) & (df_split_name.Title==title),'Age'] = median_age
df_split_name

* Cut `Fare` and `Age` into bins

In [None]:
df_cut = df_split_name.copy()

# Cut label into 5 piece
df_cut['FareBin'] = pd.qcut(df_cut.Fare, 5)

# Transform it by encoding
label = LabelEncoder()
df_cut['FareBin_Code'] = label.fit_transform(df_cut['FareBin'])

# Drop unnecessary attrs.
df_cut.drop(['FareBin'], 1, inplace=True)

In [None]:
# Cut label into 5 piece
df_cut['AgeBin'] = pd.qcut(df_cut.Age, 5)

# Transform it by encoding
label = LabelEncoder()
df_cut['AgeBin_code'] = label.fit_transform(df_cut['AgeBin'])

# Drop unnecessary attrs.
df_cut.drop(['Age','AgeBin'], 1, inplace=True)

df_cut.head()

* `SibSp` and `Parch` attributes can be combined as number of family member aboard.

In [None]:
# Combine two attributes
df_comb = df_cut.copy()
df_comb['Family_members_aboard'] = df_comb['SibSp'] + df_comb['Parch']
df_comb.drop(['SibSp','Parch'], axis=1, inplace=True)
df_comb.head()

* Try to extract family information

In [None]:
df_extr_family = df_comb.copy()
df_extr_family.insert(2,'Surname',df_extr_family['Name'].str.extract('([A-Za-z]+)\,', expand=True)[0])

DEFAULT_SURVIVAL_VALUE = 0.5
df_extr_family['Family_Survival'] = DEFAULT_SURVIVAL_VALUE
df_extr_family.reset_index(inplace=True)
for surname, sur_group in df_extr_family[df_extr_family['Family_members_aboard'] > 0].groupby(['Surname','Fare']):
    for ind, row in sur_group.iterrows():
        smax = sur_group.drop(ind).Survived.max()
        smin = sur_group.drop(ind).Survived.min()
        passID = row['PassengerId']
        if smax == 1:
            df_extr_family.loc[df_extr_family['PassengerId'] == passID, 'Family_Survival'] = 1
        elif smin == 0:
            df_extr_family.loc[df_extr_family['PassengerId'] == passID, 'Family_Survival'] = 0
df_extr_family[df_extr_family.Family_Survival != 0.5]

* Encode `Embarked` and `Sex` columns

In [None]:
df_enc = df_extr_family.copy()

# Encode Embarked
label = LabelEncoder()
df_enc['Embarked_code'] = label.fit_transform(df_enc['Embarked'])

# Encode Sex
label = LabelEncoder()
df_enc['Sex_code'] = label.fit_transform(df_enc['Sex'])

df_enc.drop(['Sex', 'Embarked'], 1, inplace=True)
df_enc.head()

* Drop unnecessary columns which are `Name`, `Ticket`, `Surname`, `Fare` and `Cabin`

In [None]:
def drop_cols(cols):
    return df_enc.drop(cols, axis=1)

attr_to_drop = ['Title', 'Surname', 'Name', 'Ticket', 'Cabin', 'Fare']
df_prepared = drop_cols(attr_to_drop)
df_prepared.set_index('PassengerId',inplace=True)
train_ready = df_prepared[:891]
submission = df_prepared[891:]
train_ready

<a name ='ml'></a>
# Machine Learning 

In [None]:
X = train_ready.drop('Survived',1)
y = train_ready['Survived']

X_submission = submission.drop('Survived',1)

* Scale the data and submission attributes with `StandardScaler`

In [None]:
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
X_submission = std_scaler.transform(X_submission)

* from sklearn.linear_model import SGDClassifier, LogisticRegression
* from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
* from sklearn.neighbors import KNeighborsClassifier
* from sklearn.naive_bayes import GaussianNB
* from sklearn.tree import DecisionTreeClassifier
* from sklearn.svm import SVC

In [None]:

def compare_clf(classifiers):
    rows = []
    for clf in classifiers:
        start = time.time()
        score_arr = cross_val_score(clf,X,y,cv=5,scoring='roc_auc')
        end = time.time()
        for i, score in enumerate(score_arr):
            score_dict = {
                'fold':i+1,
                'Classifier':clf.__class__.__name__,
                'Score':score,
                'Time (sec)':end-start
            }
            rows.append(score_dict)
    return pd.DataFrame(rows)
            
classifiers = [
    SGDClassifier(),
    LogisticRegression(),
    LinearDiscriminantAnalysis(),
    GaussianNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(),
    KNeighborsClassifier()
]

In [None]:
compare_clf(classifiers).groupby('Classifier').agg({'mean','median','std'}).drop('fold',1).sort_values(('Score','mean'),ascending=False)

* Linear Discriminant Analysis Selected

<a name ='evaluate'></a>
# Fine-Tune the Model

In [None]:
n_neighbors = [6,7,8,9,10,11,12,14,16,18,20,22]
algorithm = ['auto']
weights = ['uniform', 'distance']
leaf_size = list(range(1,50,5))
hyperparams = {'algorithm': algorithm, 'weights': weights, 'leaf_size': leaf_size, 
               'n_neighbors': n_neighbors}

gd=GridSearchCV(estimator = KNeighborsClassifier(), param_grid = hyperparams, verbose=True, 
                cv=5, scoring = "roc_auc")
gd.fit(X, y)
print(gd.best_score_)
print(gd.best_estimator_)

In [None]:
gd.best_estimator_.fit(X, y)
y_pred = gd.best_estimator_.predict(X_submission)

In [None]:
submit=pd.DataFrame(data=y_pred, index=submission.index, columns=['Survived'], dtype='int')
submit.to_csv('submission.csv')