In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Titanic Survival Analysis and Prediction (Top 10%-Beginner Friendly)

This notebook aims to analyze Titanic Disaster and build models to predict survival.  
* [**Author**](https://www.linkedin.com/in/chi-wang-22a337207/)
* [**Dataset**](https://www.kaggle.com/c/titanic/data)

## Key Findings
* The Label(Survived) is imbalanced. Survived passengers makes up 1/3 of the total number.
* People who have Carbin information are likely to survive.
* Female are more likely to survive than male.
* People with 1(less) sibling seems to have a higer survival rate. The more siblings, the lower survival rate.
* The higher the class level is, the more possible the passenger can survive.
* The more expensive the ticket is, the more possible the passenger can survive.(It is coherent with the class level)

## Tips
* Divide the whole project into several stages, each stage save a middle version of data. It's helpful for later examining.
* Columns with too much Missing value(>40%) should be useless. However, the "Missing value information" could be valuable. Eg. Cabin.
* Find the high correlation feature as the group key for group imputation.
* It's better to bin the numerical features in Classification scenario. The strategy of binning(number,range,step) is important. It's better to do it with Visualization.
* Take the advantage of all the given dataset(train and test). Eg. Imputation, Modelling.
* Feature Engineering is so important, more than hyper-parameter tuning.
* Overfitting usually happens especially in the relatively small dataset. Complex models not always outperform than simple model.
* Do not trust the 100% accuracy performance in the leaderboard. They are cheating.
* Do not be very struggle with the leaderboard performance. As long as you do everything right, the performance should be acceptable(Top 10%,20%).
* Don't waste too much time to make small improvement (balance!!!). The important part is the process/method of Analysis and Modelling.
* Learn from the best. Kernel(Top voted, Top comments) 

## Issues
* Didn't apply feature scaleing for the needed models(Eg. KNN, Logistic regression). Although, Tree-based models don't need this.
* Didn't apply parameter tuning for modeling.
* Could enrich the type of the visualization.

## Reference(Thanks for inspiration)
* https://www.kaggle.com/javiervallejos/titanic-simple-decision-tree-model-score-top-3/notebook
* https://www.kaggle.com/giorgosfoukarakis/titanic-from-eda-to-the-power-of-ensembles-top4

# Table of Content
1. [Data Overview](#1)
    * [1. Load Data](#1.1)
    * [2. Data Type](#1.2)
    * [3. Statistical View](#1.3)
2. [Data Preprocessing](#2)
    * [1. Extract Potential Information](#2.1)
        * [Title](#2.1.1)
    * [2. Drop irrelevant columns](#2.2)
    * [3. Missing Value Detection](#2.3)
    * [4. Data Imputation](#2.4)
        * [1.Missing Value Exploratory](#2.4.1)
            * [Age](#2.4.1.1)
            * [Fare](#2.4.1.2)
            * [Embarked](#2.4.1.3)
            * [Cabin](#2.4.1.4)
        * [2. Median imputation](#2.4.2)
        * [3. Majority value imputation](#2.4.3)
3. [Data Analysis](#3)
    * [1. What is the distribution of survival? ](#3.1)
    * [2. What is the distribution of Sex on survival? ](#3.2)
    * [3. What is the distribution of Pclass on survival? ](#3.3)
    * [4. What is the distribution of SibSp on survival? ](#3.4)
    * [5. What is the distribution of Parch on survival? ](#3.5)
    * [6. What is the distribution of Embarked on survival? ](#3.6)
    * [7. What is the distribution of Age on survival? ](#3.7)
    * [8. What is the distribution of Fare on survival? ](#3.8)
4. [Feature Engineering](#4)
    * [1. Create new features ](#4.1)
        * [Family Size](#4.1.1)
    * [2. One-Hot Encoding ](#4.2)
    * [3. Label Encoding ](#4.3)
5. [Modelling](#5)
    * [1. Train Test Split ](#5.1)
    * [2. Train Models ](#5.2)
        * [1. Logistic regression ](#5.2.1)
        * [2. k-nearest neighbors ](#5.2.2)
        * [3. Support Vector Machine ](#5.2.3)
        * [4. Decision Tree ](#5.2.4)
        * [5. Random Forest ](#5.2.5)
        * [6. Gradient boosting ](#5.2.6)
        * [7. XGBoost ](#5.2.7)
        * [8. CatBoost ](#5.2.8)
        * [9. LGBoost ](#5.2.9)
    * [3. Model Comparison ](#5.3)
6. [Prediction](#6)
    * [1. Drop irrelevant columns](#6.1)
    * [2. Feature Engineering](#6.2)
    * [3. Make Prediction](#6.3)
    * [4. Save the Prediction to CSV file](#6.4)

<a id="1"></a>
# 1. Data Overview

In [1]:
# Import packages

## Basic data processing
import numpy as np
import pandas as pd

## Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

## Modelling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score, StratifiedShuffleSplit, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

## Settings
pd.set_option('display.max_columns', 500) # Able to display more columns.

<a id="1.1"></a>
## 1.1. Load Data

In [1]:
# Load the dataset
train_df = pd.read_csv("../input/titanic/train.csv")
test_df = pd.read_csv("../input/titanic/test.csv")
data_df = pd.concat([train_df, test_df])
data_df.info() # show entries, dtypes, memory useage.

In [1]:
# Have a look
data_df.head(5)

In [1]:
# Check the shape of train,test and the whole dataset
train_df.shape, test_df.shape, data_df.shape

<a id="1.2"></a>
## 1.2. Data Type

> [NOIR](https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/): Nominal, Ordinal, Interval, Ratio.   
Specify the data type of each variable for the following statistic analysis.  

| Variable              | Type     | Description |
| :---                  |  :----:  |  :----:     |
| PassengerID | Nominal | The unique ID of passenger |
| Name | Nominal | Name
| Survived | Nominal  | whether the passenger survived; 0 = No, 1 = Yes | 
| Pclass | Ordinal | Ticket class; 1 = 1st, 2 = 2nd, 3 = 3rd |
| Sex | Nominal | Sex | 
| Age | Ratio | Age in years | 
| Sibsp | Ordinal | Number of siblings / spouses aboard the Titanic;| 
| Parch | Ordinal | Number of parents / children aboard the Titanic;| 
| Ticket | Nominal | Ticket number |
| Fare | Ratio | Passenger fare |
| Cabin | Nominal | Cabin number | 
| Embarked | Nominal | Port of Embarkation |


<a id="1.3"></a>
## 1.3. Statistical View 

In [1]:
# Basic statistic on Ordinal, Interval and Ratio data.
OIR_columns =  ['Pclass','Age', 'Fare', 'SibSp', 'Parch']
data_df[OIR_columns].describe()

1. Most passengers are young(75%)
2. There are some expensive tickets(> 500), some weird price(min=0)
3. It seems most of the passengers have fewer relatives(SibSp/Parch)---75%

In [1]:
# Basic statistic on Nominal data
data_df.loc[:, ~data_df.columns.isin(OIR_columns)].astype("object").describe() # All the Nominal data can be treated as "object" type for simplicity.

1. There are 3 Ports of Embarkation
2. It seems that there are less Cabin information. There are some people share the same cabin
3. There are some people have the same ticket(1309-929=380)

<a id="2"></a>
# 2. Data Preprocessing

In [1]:
'''
Description: Calculate and Visualize survival rate for chosen feature.
Args:
    data: The dataset
    feature: The chosen feature
    graph_type: The graph type. Eg. "bar", "point"
Return: None
'''
def survival_rate(data, feature, graph_type):
    # Calculate survival rate
    print(data[[feature, 'Survived']].groupby([feature], as_index=False).mean().sort_values(by='Survived'))
    # Visualization
    sns.catplot(x=feature, y="Survived", data=data, kind=graph_type, height=6, aspect=1)\
       .set_ylabels("Survival Rate")\
       .ax.set_title(f"Survival Rate on {feature}", fontsize = 20)

<a id="2.1"></a>
## 2.1. Extract Potential Information

<a id="2.1.1"></a>
### Title

In [1]:
# Extract title from Name
data_df['title'] = data_df['Name'].map(lambda x:x.split(',')[1].split('.')[0].strip())
data_df['title'].value_counts()

In [1]:
# Aggregate rare titles
title_dic={
    'Mr':'Mr',
    'Miss':'Miss',
    'Mrs':'Mrs',
    'Master':'Master',
    'Dr':'Other',
    'Rev':'Other',
    'Mlle':'Miss',
    'Col': 'Other',
    'Major':'Other',
    'Sir':'Mr',
    'Mme':'Miss',
    'Jonkheer':'Other',
    'Lady':'Miss',
    'Capt':'Other',
    'Don':'Mr',
    'Dona':'Mrs',
    'Ms':'Miss',
    'the Countess':'Other'
}

In [1]:
# Use the defined rules to set categories for title
data_df['title'] = data_df['title'].map(title_dic)
survival_rate(data_df, 'title', 'bar')

<a id="2.2"></a>
## 2.2. Drop irrelevant columns

In [1]:
# Irrelevant columns
'''
PassengerId: Passenger Id is useless for analysis and modeling.
Name: Title has already been extracted.
Ticket: Ticket seems useless here.
'''
irrelevant_columns = ['PassengerId', 'Name', 'Ticket']
data_preprocessed_df = data_df.drop(irrelevant_columns, axis=1)

<a id="2.3"></a>
## 2.3. Missing Value Detection

In [1]:
# Replace the empty data with NaN
data_preprocessed_df.replace("", float("NaN"), inplace=True)
data_preprocessed_df.replace(" ", float("NaN"), inplace=True)

# Count missing value(NaN, na, null, None) of each columns, Then transform the result to a pandas dataframe. 
count_missing_value = data_preprocessed_df.isna().sum() / data_preprocessed_df.shape[0] * 100
count_missing_value_df = pd.DataFrame(count_missing_value.sort_values(ascending=False), columns=['Missing%'])

In [1]:
# Visualize the percentage(>0) of Missing value in each column.
missing_value_df = count_missing_value_df[count_missing_value_df['Missing%'] > 0]

plt.figure(figsize=(5, 8)) # Set the figure size
missing_value_graph = sns.barplot(x = missing_value_df.index, y = "Missing%", data=missing_value_df, orient="v")
missing_value_graph.set_title("Percentage Missing Value", fontsize = 20)
missing_value_graph.set_xlabel("Features")
for p in missing_value_graph.patches:
        missing_value_graph.annotate(round(p.get_height(), 2), (p.get_x()+0.25, p.get_height())) #show value on each bar

This is the overall missing value of the whole dataset(train + test).

<a id="2.4"></a>
## 2.4. Data Imputation
> Choose the suitable imputation tech which can highly represent the central tendency of the data.

<a id="2.4.1"></a>
### 2.4.1. Missing Value Exploratory

<a id="2.4.1.1"></a>
### Age

In [1]:
# Visualize the distribution of Age
age_fig = go.Figure()
age_fig.add_trace(go.Box(
                        y=data_preprocessed_df["Age"],
                        name='Age',
                        boxmean=True))

age_fig.update_layout(
                height=600, 
                width=800,
                title={
                'text': "The Distribution of Age",
                'font': {'size': 24},
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'},
                yaxis_title='Age',
                )

age_fig.show()

It seems overall mean and median of age doesn't differ so much.

<a id="2.4.1.2"></a>
### Fare

In [1]:
# Visualize the distribution of Fare
fare_fig = go.Figure()
fare_fig.add_trace(go.Box(
                        y=data_preprocessed_df["Fare"],
                        name='Fare',
                        boxmean=True))

fare_fig.update_layout(
                height=600, 
                width=800,
                title={
                'text': "The Distribution of Fare",
                'font': {'size': 24},
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'},
                yaxis_title='Fare',
                )

fare_fig.show()

It seems mean and median of age does differ so much. However, it is only 1 Missing value, it should not affect the modelling process so much.

<a id="2.4.1.3"></a>
### Embarked

In [1]:
# Visualize the distribution of Embarked
embarked_fig = px.histogram(data_preprocessed_df, x="Embarked")
embarked_fig.update_layout(
                height=600, 
                width=800,
                title={
                'text': "The count of Embarked",
                'font': {'size': 24},
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'},
                )

embarked_fig.show()

It seems S is the most frequent value of Embarked.  
**It seems that we can try simplely use Median imputation on *Age* and *Fare* and Majority value imputation on *Embarked*.**

<a id="2.4.1.4"></a>
### Cabin

In [1]:
# Calculate the relationship between Cabin and Survived Rate
data_preprocessed_df.groupby(data_preprocessed_df['Cabin'].isnull())['Survived'].mean()

It seems people who have Carbin information are likely to survive.

In [1]:
# Transform Carbin information to an indicator representing whether passenger have Carbin information.
data_preprocessed_df['Cabin_indicator'] = np.where(data_preprocessed_df['Cabin'].isnull(), 0, 1)
data_preprocessed_df.drop('Cabin', axis=1, inplace=True)

In [1]:
data_preprocessed_df

<a id="2.4.2"></a>
### 2.4.2. Median imputation 

In [1]:
# Find correlated column with Age
plt.figure(figsize=(12,9))
sns.heatmap(data_preprocessed_df.corr(), cmap="coolwarm", annot = True, fmt='.3f').set_title('Pearson Correlation', fontsize=22)

In [1]:
data_preprocessed_median_df = data_preprocessed_df.copy()
#data_preprocessed_median_df['Age'] = data_preprocessed_df['Age'].fillna(data_preprocessed_df['Age'].median())
# Group imputation for Age by 'Pclass' and 'Sex'
data_preprocessed_median_df["Age"].fillna(data_preprocessed_median_df.groupby(['Pclass','Sex'])['Age'].transform("median"), inplace=True)
data_preprocessed_median_df['Fare'] = data_preprocessed_df['Fare'].fillna(data_preprocessed_df['Fare'].median())

<a id="2.4.3"></a>
### 2.4.3. Majority value imputation 

In [1]:
data_preprocessed_median_df['Embarked'].fillna(data_preprocessed_median_df['Embarked'].mode()[0], inplace=True)
# Check Missing value for
print(f'Missing value:\n Median imputation: {sum(data_preprocessed_median_df.isna().sum())}')

418 is the number of survived passenger in test data. Therefore, it means no Missing value in the dataset.

In [1]:
# Cut the Age and Fare into bins. Set labels by alphabetically for later encoding.
data_preprocessed_median_df['AgeBin'] = pd.cut(data_preprocessed_median_df['Age'].astype(int), 5, labels=['a', 'b', 'c', 'd','e'])
data_preprocessed_median_df['FareBin'] = pd.cut(data_preprocessed_median_df['Fare'].astype(int), 4, labels=['a', 'b', 'c', 'd'])

# Women and Children indicator
#data_preprocessed_median_df['WomChi'] = ((data_preprocessed_median_df.AgeBin == 'a') | (data_preprocessed_median_df.Sex == 'female'))

In [1]:
# Create a new dataset only include training data.
data_best_df = data_preprocessed_median_df.iloc[:891].copy()

<a id="3"></a>
# 3. Data Analysis
* Sex, Pclass, SibSp, Parch, Embarked - Pie
* Age, Fare - Boxplot
* Survival Rate - Barplot/Lineplot

<a id="3.1"></a>
## 3.1. What is the distribution of survival?

In [1]:
# Count the number of survived(0/1), transform the result to pandas dataframe
survival_counts = data_best_df["Survived"].value_counts()
survival_counts_df = pd.DataFrame(survival_counts)

In [1]:
# Visualize the distribution of the survival
survival_fig = make_subplots(
    rows=1, cols=2, 
    specs=[[{"type": "xy"}, {"type": "domain"}]])

survival_fig.add_trace(go.Bar(x=survival_counts_df.index, 
                              y=survival_counts_df["Survived"],
                              text=survival_counts_df["Survived"],
                              textposition='outside',
                              showlegend=False),
                              1, 1)

survival_fig.add_trace(go.Pie(labels=survival_counts_df.index, 
                     values=survival_counts_df["Survived"],
                     showlegend=True),
                     1, 2)

survival_fig.update_layout(
                  height=600, 
                  width=1000,
                  title={
                  'text': "The distribution of Survival",
                  'font': {'size': 24},
                  'y':0.95,
                  'x':0.5,
                  'xanchor': 'center',
                  'yanchor': 'top'},
                  xaxis1_title = 'Survived', 
                  yaxis1_title = 'Counts',
                  legend_title_text="Survived"
                 )
survival_fig.update_xaxes(type='category')
survival_fig.show()

* The Label(Survived) is imbalanced. The survived passenger makes up 1/3 of the total number.

<a id="3.2"></a>
## 3.2. What is the distribution of Sex on survival?

In [1]:
# Visualize the categorical data by Pie chart
'''
Description:
Args:
    data: The dataset is going to be visualized
    feature: The chosen feature
Return: None
'''
def categorical_VIS(data, feature):
    
    # Calculate the distribution of the chosen feature
    survived = data[data["Survived"] == 1][feature]
    survived_df = pd.DataFrame(survived.value_counts())
    not_survived = data[data["Survived"] == 0][feature]
    not_survived_df = pd.DataFrame(not_survived.value_counts())
    
    # Visualization
    survival_fig = make_subplots(
    rows=1, cols=2, 
    subplot_titles=("Survived", "Non-Survived"),
    specs=[[{"type": "domain"}, {"type": "domain"}]])
    survival_fig.add_trace(go.Pie(labels=survived_df.index, 
                     values=survived_df[feature],
                     showlegend=True),
                     1, 1)
    survival_fig.add_trace(go.Pie(labels=not_survived_df.index, 
                     values=not_survived_df[feature],
                     showlegend=True),
                     1, 2)
    survival_fig.update_layout(
                  height=600, 
                  width=1000,
                  title={
                  'text': "The Distribution of "+ feature + " on Survival",
                  'font': {'size': 24},
                  'y':0.95,
                  'x':0.5,
                  'xanchor': 'center',
                  'yanchor': 'top'},
                  legend_title_text=feature
                 )
    survival_fig.update_xaxes(type='category')
    survival_fig.show()

In [1]:
categorical_VIS(data_best_df, "Sex")

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'Sex', 'bar')

* Female are more likely to survive than male.

<a id="3.3"></a>
## 3.3. What is the distribution of Pclass on survival?

In [1]:
categorical_VIS(data_best_df, "Pclass")

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'Pclass', 'bar')

* The higher the class level is, the more possible the passenger can survive.

<a id="3.4"></a>
## 3.4. What is the distribution of SibSp on survival?

In [1]:
categorical_VIS(data_best_df, "SibSp")

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'SibSp', 'point')

People with 1(fewer) sibling seems to have a higer survival rate.

<a id="3.5"></a>
## 3.5. What is the distribution of Parch on survival?

In [1]:
categorical_VIS(data_best_df, "Parch")

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'Parch', 'point')

It seems people who have fewer parents or children are likely to survive.

<a id="3.6"></a>
## 3.6. What is the distribution of Embarked on survival?

In [1]:
categorical_VIS(data_best_df, "Embarked")

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'Embarked', 'bar')

It seems people who from C(Cherbourg) have a relatively high survival rate.

<a id="3.7"></a>
## 3.7. What is the distribution of Age on survival?

In [1]:
age_fig = px.box(data_best_df, x="Survived", y="Age")
age_fig.update_layout(
                  height=600, 
                  width=1000,
                  title={
                  'text': "The Distribution of Age on Survival",
                  'font': {'size': 24},
                  'y':0.95,
                  'x':0.5,
                  'xanchor': 'center',
                  'yanchor': 'top'},
                  legend_title_text="Survived"
                 )

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'AgeBin', 'point')

<a id="3.8"></a>
## 3.8. What is the distribution of Fare on survival?

In [1]:
fare_fig = px.box(data_best_df, x="Survived", y="Fare")
fare_fig.update_layout(
                  height=600, 
                  width=1000,
                  title={
                  'text': "The Distribution of Fare on Survival",
                  'font': {'size': 24},
                  'y':0.95,
                  'x':0.5,
                  'xanchor': 'center',
                  'yanchor': 'top'},
                  legend_title_text="Survived"
                 )

In [1]:
# Calcute the survival Rate
survival_rate(data_best_df, 'FareBin', 'point')

* The more expensive the ticket is, the more possible the passenger can survive.

<a id="4"></a>
# 4. Feature Engineering

<a id="4.1"></a>
## 4.1. Create new features

<a id="4.1.1"></a>
### Family Size

In [1]:
# Combine the SibSp and Parch to Family Size
data_best_df['Family_size'] = data_best_df['SibSp'] + data_best_df['Parch'] + 1
survival_rate(data_best_df, 'Family_size', 'point')

In [1]:
# Binning the family_size
data_best_df.loc[data_best_df['Family_size'] == 1, 'Family_size'] = 0 # Alone
data_best_df.loc[(data_best_df['Family_size'] > 1) & (data_best_df['Family_size'] <= 4), 'Family_size'] = 1  # Small Family 
data_best_df.loc[(data_best_df['Family_size'] > 4) & (data_best_df['Family_size'] <= 6), 'Family_size'] = 2  # Medium Family
data_best_df.loc[data_best_df['Family_size']  > 6, 'Family_size'] = 3 # Large Family

In [1]:
# Feature SibSp and Parch are replaced by Family_size
data_best_df.drop(['SibSp', 'Parch'], axis=1, inplace=True)

In [1]:
# Feature Age and Fare are replaced by AgeBin and FareBin
data_best_df.drop(['Age', 'Fare'], axis=1, inplace=True)

<a id="4.2"></a>
## 4.2. One-Hot Encoding

In [1]:
# Select features that are suitable for One Hot Encoding
onehot_features = ["Embarked","title"]
onehot_df = pd.get_dummies(data_best_df[onehot_features])
data_best_df.drop(onehot_features, axis=1, inplace=True)
data_best_df = pd.concat([data_best_df, onehot_df], axis=1)

<a id="4.3"></a>
## 4.3. Label Encoding

In [1]:
# Select features that are suitable for Label Encoding
data_best_df["Sex"]  = LabelEncoder().fit_transform(data_best_df["Sex"])
data_best_df["AgeBin"]  = LabelEncoder().fit_transform(data_best_df["AgeBin"])
data_best_df["FareBin"]  = LabelEncoder().fit_transform(data_best_df["FareBin"])

In [1]:
data_best_df

<a id="5"></a>
# 5. Modelling

<a id="5.1"></a>
## 5.1. Train Test Split

In [1]:
# Train/Test Split
X = data_best_df.drop(["Survived"], axis=1)
Y = data_best_df.Survived.astype('int8')
x_train, x_test, y_train, y_test = train_test_split(X, Y,test_size = 0.25, random_state=0, stratify=Y)

<a id="5.2"></a>
## 5.2. Train and Validation

<a id="5.2.1"></a>
### 5.2.1 Logistic regression

In [1]:
# Logistic regression
# https://stackoverflow.com/questions/65682019/attributeerror-str-object-has-no-attribute-decode-in-fitting-logistic-regre
lr = LogisticRegression(penalty = 'l2',solver = 'liblinear')
lr.fit(x_train, y_train)
predictions = lr.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
lr_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
lr_cv_avg = cross_val_score(lr, X, Y, cv=lr_cv, scoring="accuracy").mean()
lr_cv_avg

<a id="5.2.2"></a>
### 5.2.2 k-nearest neighbors

In [1]:
# KNN
kNN = KNeighborsClassifier()
kNN.fit(x_train, y_train)
predictions = kNN.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
kNN_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
kNN_cv_avg = cross_val_score(kNN, X, Y, cv=kNN_cv, scoring="accuracy").mean()
kNN_cv_avg

<a id="5.2.3"></a>
### 5.2.3 Support Vector Machine

In [1]:
# SVM
svm = SVC(probability=True)
svm.fit(x_train, y_train)
predictions = svm.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
svm_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
svm_cv_avg = cross_val_score(svm, X, Y, cv=svm_cv, scoring="accuracy").mean()
svm_cv_avg

<a id="5.2.4"></a>
### 5.2.4 Decision Tree

In [1]:
# Decision tree
dt = DecisionTreeClassifier(random_state=0)
dt.fit(x_train, y_train)
predictions = dt.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
dt_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
dt_cv_avg = cross_val_score(dt, X, Y, cv=dt_cv, scoring="accuracy").mean()
dt_cv_avg

<a id="5.2.5"></a>
### 5.2.5 Random Forest

In [1]:
# Random Forest
rf = RandomForestClassifier(random_state=0)
rf.fit(x_train, y_train)
predictions = rf.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
rf_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rf_cv_avg = cross_val_score(rf, X, Y, cv=rf_cv, scoring="accuracy").mean()
rf_cv_avg

<a id="5.2.6"></a>
### 5.2.6 Gradient boosting

In [1]:
# Gradient boosting
gbt = GradientBoostingClassifier(random_state=0)
gbt.fit(x_train, y_train)
predictions = gbt.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
gbt_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
gbt_cv_avg = cross_val_score(gbt, X, Y, cv=gbt_cv, scoring="accuracy").mean()
gbt_cv_avg

<a id="5.2.7"></a>
### 5.2.7 XGBoost

In [1]:
xgbc = XGBClassifier(random_state=0, use_label_encoder=False, eval_metric='error')
xgbc.fit(x_train, y_train)
predictions = xgbc.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
xgbc_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
xgbc_cv_avg = cross_val_score(xgbc, X, Y, cv=xgbc_cv, scoring="accuracy").mean()
xgbc_cv_avg

<a id="5.2.8"></a>
### 5.2.8 CatBoost

In [1]:
catbc = CatBoostClassifier(random_state=0, eval_metric='Accuracy', verbose=False)
catbc.fit(x_train, y_train)
predictions = catbc.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
catbc_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
catbc_cv_avg = cross_val_score(catbc, X, Y, cv=catbc_cv, scoring="accuracy").mean()
catbc_cv_avg

<a id="5.2.9"></a>
### 5.2.9 LGBoost

In [1]:
lgbc = LGBMClassifier(random_state=0)
lgbc.fit(x_train, y_train, eval_metric='Accuracy', verbose=-1)
predictions = lgbc.predict(x_test)
print(classification_report(y_test, predictions))

In [1]:
# Cross Validation
lgbc_cv = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
lgbc_cv_avg = cross_val_score(lgbc, X, Y, cv=lgbc_cv, scoring="accuracy", verbose=False).mean()
lgbc_cv_avg

<a id="5.3"></a>
## 5.3. Model Comparison

In [1]:
# Collect all the model performance
model_comparison = pd.DataFrame(data = [lr_cv_avg, kNN_cv_avg, svm_cv_avg, dt_cv_avg, rf_cv_avg, gbt_cv_avg, xgbc_cv_avg, catbc_cv_avg, lgbc_cv_avg], 
                                index = ["lr", "kNN", "SVM", "DT", "RF", "GBT", "XGBoost", "CatBoost", "LGBoost"],
                                columns=['Accuracy'])\
                      .sort_values(by = "Accuracy", ascending=False)

model_comparison

It seems that the top 5 models are: lr, SVM, GBT, CatBoost, XGBoost

<a id="6"></a>
# 6. Prediction

In [1]:
# Get the test dataset
testing_df = data_preprocessed_median_df.iloc[891:].copy()
testing_df.info() # show entries, dtypes, memory useage.

In [1]:
# Have a look
testing_df.head()

<a id="6.1"></a>
## 6.1. Drop irrelevant columns

In [1]:
# Drop Irrelevant columns
test_preprocessed_df = testing_df.drop('Survived', axis=1)

# Feature Age and Fare are replaced by AgeBin and FareBin
test_preprocessed_df.drop(["Age", "Fare"],axis=1, inplace=True)

# Check Missing value
test_preprocessed_df.isna().sum()

<a id="6.2"></a>
## 6.2. Feature Engineering

In [1]:
# Combine the SibSp and Parch to Family Size
test_preprocessed_df['Family_size'] = test_preprocessed_df['SibSp'] + test_preprocessed_df['Parch'] + 1
test_preprocessed_df.drop(['SibSp', 'Parch'], axis=1, inplace=True)

In [1]:
# Binning the family_size
test_preprocessed_df.loc[test_preprocessed_df['Family_size'] == 1, 'Family_size'] = 0 # Alone
test_preprocessed_df.loc[(test_preprocessed_df['Family_size'] > 1) & (test_preprocessed_df['Family_size'] <= 4), 'Family_size'] = 1  # Small Family 
test_preprocessed_df.loc[(test_preprocessed_df['Family_size'] > 4) & (test_preprocessed_df['Family_size'] <= 6), 'Family_size'] = 2  # Medium Family
test_preprocessed_df.loc[test_preprocessed_df['Family_size']  > 6, 'Family_size'] = 3 # Large Family

In [1]:
# One-Hot Encoding
onehot_df = pd.get_dummies(test_preprocessed_df[onehot_features])
test_preprocessed_df.drop(onehot_features, axis=1, inplace=True)
test_preprocessed_df = pd.concat([test_preprocessed_df, onehot_df], axis=1)

# Label Encoding
test_preprocessed_df["Sex"]  = LabelEncoder().fit_transform(test_preprocessed_df["Sex"])
test_preprocessed_df["AgeBin"]  = LabelEncoder().fit_transform(test_preprocessed_df["AgeBin"])
test_preprocessed_df["FareBin"]  = LabelEncoder().fit_transform(test_preprocessed_df["FareBin"])

# Check the data after Feature Engineering
test_preprocessed_df.info()

In [1]:
test_preprocessed_df

<a id="6.3"></a>
## 6.3. Make Prediction

In [1]:
# Make Prediction by the top 5 classifiers
voting_clas = VotingClassifier(estimators=[('SVM', svm), ('Logistic_Reg', lr), ('CatBoost', catbc), ('GBt',gbt), ('XGBoost',xgbc)], voting='soft', n_jobs=-1)
votingC = voting_clas.fit(X, Y)

results = votingC.predict(test_preprocessed_df)
results_df = pd.DataFrame(results, columns=['Survived'])
predictions_df = pd.concat([test_df['PassengerId'], results_df], axis=1)

<a id="6.4"></a>
## 6.4. Save the Prediction to CSV file

In [1]:
# Save predictions to .csv for project submission
predictions_df.to_csv('submission.csv', index=False)

# Thanks for reading, have a good day ~