# Define the Problem

For this project, the problem statement is given to us on a golden plater, develop an algorithm to predict the survival outcome of passengers on the Titanic.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. <br>
- Tools
    - It's a classic **Binary classification**. 
- Data
    - The dataset is used for this competition is synthetic but based on a real dataset (in this case, the actual [Titanic data](https://www.kaggle.com/c/titanic/data)!) and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there's no way to "cheat" by using public labels for predictions. How well does your model perform on truly private test labels?

Good luck and have fun!

# Data Load In

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data_raw = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
data_val = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
data1 = data_raw.copy(deep=True)
data_cleaner = [data1, data_val]

Target = ['Survived']

In [None]:
for dataset in data_cleaner:
    print(dataset.info())
    print(dataset.describe(include='all'))

# Data Preprocessing

## Fill NA

In [None]:
for dataset in data_cleaner:
    dataset.Age.fillna(dataset.Age.median(), inplace=True)
    dataset.Embarked.fillna('S', inplace=True)
    dataset.Fare.fillna(dataset.Fare.median(), inplace=True)
    # dataset['Title'] = dataset.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
    dataset['Family_members'] = dataset.Parch + dataset.SibSp

In [None]:
data1.sample(5)

## A Bit Visualization

Check distribution for setting bins

In [None]:
data1.drop(['Name', 'PassengerId', 'Ticket', 'SibSp', 'Parch'], axis=1, inplace=True)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for i in data1.columns:
    sns.countplot(data1[i])
    plt.show()

## Cabin

In [None]:
# data1.groupby(data1['Cabin'].isnull()).mean()
data1.groupby(data1['Cabin'].isnull())['Survived'].mean()

In [None]:
for dataset in data_cleaner:
    dataset['Cabin_Allotted'] = np.where(dataset.Cabin.isnull(), 0, 1)
    dataset.drop('Cabin', axis=1, inplace=True)

In [None]:
data1.sample(5)
# data1['Title'].value_counts()

## Encoder

It all goes on how you explain your feature's (X) effect on the final prediction (Y).<br>
Let's take a look at the feature `Fare`.<br>
Do you think the Fare's increment like 1 dollar would make any significant impact on their survival rate?<br>
The answer is quite obvious, right? But what if we raise the increment to i.e. 30 dollars..? Hard to tell huh!<br>
Again, it all goes on how you interpret data. In this case, I've tried them so many times. And I received the better result by *categoricalization* on `Fare` in most of them.

In [None]:
from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder()

for dataset in data_cleaner:
    dataset['Sex_labeled'] = lb.fit_transform(dataset.Sex)
    
    dataset['AgeBin'] = pd.qcut(dataset.Age, 3)
    dataset['Age_labeled'] = lb.fit_transform(dataset['AgeBin'])

    dataset['FareBin'] = pd.qcut(dataset.Fare, 4)
    dataset['Fare_labeled'] = lb.fit_transform(dataset['FareBin'])

    dataset['Embarked_labeled'] = lb.fit_transform(dataset.Embarked)

In [None]:
data1.sample(5)

In [None]:
print(data1['Age_labeled'].value_counts())
print(data1['Fare_labeled'].value_counts())

In [None]:
data1_X = [
    'Pclass', 
    'Family_members', 
    'Cabin_Allotted', 
    'Sex_labeled', 
    'Age_labeled', 
    'Fare_labeled', 
    'Embarked_labeled'
]

for i in data1[data1_X].columns:
    sns.lineplot(i, 'Survived', data=data1)
    plt.show()

In [None]:
data1[data1_X].info()

# Model Selection

An easy basic way to compare different models within the same dataset.

In [None]:
from sklearn import ensemble, tree, neighbors
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn import model_selection

MLA = [
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    neighbors.KNeighborsClassifier(), 

    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(), 

    XGBClassifier(objective='binary:logistic', eval_metric='logloss'),
    LGBMClassifier()
]

cv_split = model_selection.ShuffleSplit(n_splits=10, test_size=0.2, train_size=0.8, random_state=1)

MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD', 'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())

    cv_results = model_selection.cross_validate(alg, data1[data1_X], data1[Target].values.reshape(-1,), cv=cv_split, return_train_score=True)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean() 
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3

    row_index += 1

# MLA_compare.sort_values(by=['MLA Test Accuracy Mean'], ascending=False, inplace=True)
MLA_compare.sort_values(by=['MLA Test Accuracy Mean'], ascending=False)

# Next: Hyperparameters

What is the importance of hyperparameter tuning? <br>
Hyperparameters are crucial as they control the overall behaviour of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results. To improve your ML skills, see [Titanic Top 11%| Starter II: Hyperparameter Tuning](https://www.kaggle.com/chienhsianghung/titanic-top-11-starter-ii-hyperparameter-tuning).<br>

Further discussion: [The Best Hyperparameters](https://www.kaggle.com/c/tabular-playground-series-apr-2021/discussion/231152)

## Tuning (Not Yet Finished)

Clearly, LightGBM received the highest grade in all models using `default hyperparameters`.<br>
In the starter Titanic challenge, using RandomForest and applying a *grid searching* technique, I can improve my score by 0.02 (from 0.76-0.78).<br> 
And now, I'm trying to use LightGBM in this competition according to my [Models Selection](https://www.kaggle.com/chienhsianghung/tps-apr-starter-pack-all-models?scriptVersionId=58964521&cellId=23) result but the `grid searching` part couldn't go well (failed in 9hrs session) because the amount of data is way too massive than the starter Titanic. It's 12GB in total.

~~So, I applied my previous [grid searching result](https://www.kaggle.com/chienhsianghung/titanic-top-11-starter-ii-hyperparameter-tuning?scriptVersionId=58924477&cellId=25) which used small Titanic data for tuning. And, it didn't improve well enough as the RandomForest does though.<br>
I still want to make a `hyperparameter searching` to see how far the LightGBM could go. Because, obviously, this's not a proper neither practicable way when we encounter a massive dataset.~~

2021 Apr. 8th update___<br>
I've tried reducing grid size, it still took me to run 27 thousand seconds and return me a *not better than RF's* improvement.

In [None]:
param = [{
    'n_estimators': [1000, 1500, 2000], # [1000, 1500, 2000, 2500]
    'max_depth':  [4, 8, 11], # [4, 5, 8, 11, -1]
    'num_leaves': [15, 31, 58, 63], # [15, 31, 58, 63, 127]
    'subsample': [0.6, 0.708, 0.8], # [0.6, 0.7, 0.708, 0.8, 1.0]
    'colsample_bytree': [0.613, 0.8, 1.0], # [0.6, 0.613, 0.7, 0.8, 1.0]
    # 'learning_rate' : [0.01, 0.02, 0.03]
}]

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html
best_search = model_selection.GridSearchCV(estimator=LGBMClassifier(), param_grid=param, cv=cv_split, scoring='roc_auc')
best_search.fit(data1[data1_X], data1[Target].values.reshape(-1,))

# best_param = best_search.best_params_
# clf.set_params(**best_param)
clf = best_search.best_estimator_

print(f"Best Params:\n{str(best_search.best_params_)}")
best_search.best_score_ # Compare this with LB

# Predicting

In [None]:
# after trying several times submissions, I chose RandomForestClassifier as my highest score
# model = ensemble.RandomForestClassifier(**{'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 50, 'random_state': 1})

# model = LGBMClassifier(**{'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 1000, 'num_leaves': 15, 'subsample': 0.6})
# model.fit(data1[data1_X], data1[Target].values.reshape(-1, ))
# predictions = model.predict(data_val[data1_X])
predictions = clf.predict(data_val[data1_X])

output = pd.DataFrame({'PassengerId': data_val.PassengerId, 'Survived': predictions})
output.to_csv('./my_submission_RandomForestClassifier_tunned_F4.csv', index=False)
print("Your submission was successfully saved!")

## Features Importance

In [None]:
# https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html
from lightgbm import plot_importance

plot_importance(clf)

# References
* [A Data Science Framework: To Achieve 99% Accuracy](https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy)
* [MY FIRST KAGGLE WORK TITANIC](https://www.kaggle.com/saptarshisit/my-first-kaggle-work-titanic)