<h1 style="text-align:center">Model Training</h1>

<h4 style="text-align:center"><a href = 'https://github.com/virchan' style='text-align:center'>https://github.com/virchan</a></h4> 

<h2>Abstract</h2>

In the previous notebook, `demo_titanic_data_cleaning.ipynb`, we performed data cleaning on the dataset provided by the Kaggle competition "Titanic - Machine Learning from Disaster". Now, we proceed to train machine learning models on the refined dataset to evaluate their performance in the subsequent `demo_model_evaluation.ipynb` file.

<h2>Introduction</h2>

This notebook is a continuation of the demo_titanic_data_cleaning.ipynb file, focusing on model training. Our goal is to predict the survival of Titanic passengers by training nine machine learning models on a dataset obtained from the Kaggle competition __Titanic - Machine Learning from Disaster ([Kaggle link](https://www.kaggle.com/competitions/titanic/))__.

This document serves as a supplementary resource to the general report provided in the `README.md` file. It aims to provide technical details on the machine learning models used and presents a well-organized workflow of the model training stage. The outcome of this stage is the predictions made by each model, which will be further analyzed and evaluated in the `demo_model_evaluation.ipynb` file.

<h2>Initiating the Models</h2>

To begin our analysis, we import the necessary Python libraries.

In [2]:
from titanic_ml_classes.titanic_machine_learning_models import titanic_machine_learning_models as titanic_models

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

Among these libraries is `titanic_machine_learning_models`, which includes the `titanic_models` class designed to streamline our workflow. The `Titanic_ML` instance organizes our machine learning models using a Python dictionary, accessible via the `.models_` attribute.

In [3]:
Titanic_ML: titanic_models = titanic_models()

In [4]:
for name in Titanic_ML.models_.keys():
    print(name)

dummy
tree
forest
support_vector
neural_network
logistic
gaussian_NB
bernoulli_NB
adaboost


The dictionary comprises a diverse range of models, including:

* Two probabilistic models: Gaussian and Bernoulli Naive Bayesian Classifiers
* Two ensemble methods: Random Forest Classifier and Adaboost Classifier
* One base model: Decision Tree Classifier
* One linear model: Logistic Regressor
* One deep learning model: Feedforward Neural Network
* One support vector machine: Support Vector Classifier
* One baseline model: Dummy Classifier

Including the Dummy Classifier is important as it provides a baseline for model comparison. If a classifier cannot outperform the Dummy Classifier, it lacks reliability. While Logistic Regression serves as the simplest non-trivial model for classification tasks, we acknowledge that linear models may not capture complex data structures related to human survivability. Therefore, we introduce the Neural Network and Support Vector Machine models, which offer increased reliability. Moving forward, we note that the features in our training set are categorical. In practical model training, it is often more convenient to convert categorical features into multiple binary features, treating them as a set of yes-or-no questions. This approach naturally aligns with the Decision Tree Classifier, Bernoulli Naive Bayes Classifier, and arguably, the Random Forest Classifier. Additionally, we include the Gaussian Naive Bayes Classifier and Adaboost Classifier to diversify our range of models.

For detailed information on each model, please refer to the table below:

<table style = "width:90%">
    
  <tr>
    <th style="text-align: center">Name</th>
    <th style="text-align: center">Notation</th>
    <th style="text-align: center">Type</th>
    <th style="text-align: center">Parameters</th>
    <th style="text-align: center">Documentation</th>
    <th style="text-align: center">Note</th>
  </tr>
    
  <tr>
    <td style="text-align: right">Dummy Classifier</td>
    <td style="text-align: center"><code>dummy</code></td>
    <td style="text-align: center">Baseline Model</td>
    <td style="text-align: center">Default</td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html'>link</a></td>
    <td style="text-align: left">Always returns the most frequent label</td>
  </tr>
    
  <tr>
    <td style="text-align: right">Decision Tree Classifier</td>
    <td style="text-align: center"><code>tree</code></td>
    <td style="text-align: center">Base Model</td>
    <td style="text-align: center">Default</td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html'>link</a></td>
    <td style="text-align: left"></td>
  </tr>

  <tr>
    <td style="text-align: right">Random Forest Classifier</td>
    <td style="text-align: center"><code>forest</code></td>
    <td style="text-align: center">Ensemble Method</td>
      <td style="text-align: center"><code>n_estimators = 201</code></td>
      <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html'>link</a></td>
      <td style="text-align: left"></td>
  </tr>
    
  <tr>
    <td style="text-align: right">Support Vector Classifier</td>
    <td style="text-align: center"><code>support_vector</code></td>
    <td style="text-align: center">Support Vector Machine</td>
      <td style="text-align: center">
          <p><code>C = 1000</code>,</p>
          <p><code>gamma = 0.01</code>,</p>
          <p><code>kernel = 'rbf'</code>,</p>
          <p><code>probability = True</code></p>
      </td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html'>link</a></td>
    <td style="text-align: left"></td>
  </tr>
    
  <tr>
    <td style="text-align: right">Neural Network</td>
    <td style="text-align: center"><code>neural_network</code></td>
    <td style="text-align: center">Deep Learning Method</td>
    <td style="text-align: center">
        <p>Five <code>Dense</code> layers with <code>units = 26, 13, 7, 4, 2</code>.</p>
        <p>The last layer is activated by <code>softmax</code> and the rest are by <code>relu</code>.</p>
        <p>Compiled with <code>loss = 'binary_crossentropy'</code> and <code>optimizer = 'adam'</code>.
      </td>
    <td style="text-align: center"><a href = 'https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense'>link</a></td>
    <td style="text-align: left">Feedforward neural network</td>
  </tr>
    
  <tr>
    <td style="text-align: right">Logistic Regression</td>
    <td style="text-align: center"><code>logistic</code></td>
    <td style="text-align: center">Linear Model</td>
      <td style="text-align: center"><code>solver = 'liblinear'</code></td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>link</a></td>
    <td style="text-align: left"></td>
  </tr>
    
  <tr>
    <td style="text-align: right">Gaussian Naive Bayes Classifier</td>
    <td style="text-align: center"><code>gaussian_NB</code></td>
    <td style="text-align: center">Probabilistic Model</td>
      <td style="text-align: center"><code>var_smoothing = 0.1</code></td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html'>link</a></td>
    <td style="text-align: left"></td>
  </tr>
    
  <tr>
    <td style="text-align: right">Bernoulli Naive Bayes Classifier</td>
    <td style="text-align: center"><code>bernoulli_NB</code></td>
    <td style="text-align: center">Probabilistic Model</td>
      <td style="text-align: center"><code>var_smoothing = 0.1</code></td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB'>link</a></td>
    <td style="text-align: left"></td>
  </tr>
    
  <tr>
    <td style="text-align: right">AdaBoost Classifier</td>
    <td style="text-align: center"><code>adaboost</code></td>
    <td style="text-align: center">Ensemble Method</td>
      <td style="text-align: center"><code>n_estimators = 61</code></td>
    <td style="text-align: center"><a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB'>link</a></td>
    <td style="text-align: left"></td>
  </tr>
</table>

<h2>Training the Models</h2>

Next, we load the dataset into the `data_` dataframe and divide it into training and testing sets in a 7:3 ratio. This dataset, named `csv/train_cleaned.csv`, is a refined version of the original `csv/train.csv` file provided by Kaggle.  We carefully selected specific features from the dataset for training, including the `Pclass`, `Sex`, `Cabin`, `age_group`, and `group_size` columns. The rationale behind the selection of these features and the details of our data cleaning process can be found in the `demo_titanic_data_cleaning.ipynb` file.

In [5]:
# load the data
data_: pd.DataFrame = titanic_models.transform_data(pd.read_csv('csv/train_cleaned.csv'))

# Features for model training
X: pd.DataFrame = data_.drop(['PassengerId', 'Survived'], 
                             axis = 1 
                            )

# Labels for model training
y: pd.DataFrame = data_[['PassengerId', 'Survived']]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Store the passenger's id for each sample
id_train: pd.DataFrame = y_train['PassengerId']
id_test: pd.DataFrame = y_test['PassengerId']

y_train: pd.DataFrame = y_train.drop('PassengerId', axis = 1).values.ravel()
y_test: pd.DataFrame = y_test.drop('PassengerId', axis = 1).values.ravel()

As shown, the training set consists of 623 samples, while the testing set contains 268 samples.

In [6]:
print(f"Training set size = {X_train.shape}")
print(f"Testing set size = {X_test.shape}")    

Training set size = (623, 25)
Testing set size = (268, 25)


The `.fit()` method of the `Titanic_ML` instance emulates similar methods in the `sklearn` and `tensorflow` libraries. Calling the `.fit()` method trains all the models.

In [7]:
# Train the ML models
Titanic_ML.fit(X_train, y_train)

Training dummy...
Model dummy is ready!

Training tree...
Model tree is ready!

Training forest...
Model forest is ready!

Training support_vector...
Model support_vector is ready!

Training neural_network...
Model neural_network is ready!

Training logistic...
Model logistic is ready!

Training gaussian_NB...
Model gaussian_NB is ready!

Training bernoulli_NB...
Model bernoulli_NB is ready!

Training adaboost...
Model adaboost is ready!

All models are ready.


We can examine model parameters, such as the coefficients of the `logistic` model, and view the summary of the `neural_network` model, as displayed below.

In [8]:
# Check some of the models

print("Coefficients of the Logistic Regression")
print(Titanic_ML.models_['logistic'].coef_)
print("\n")
print("Neural Network Summary")
print(Titanic_ML.models_['neural_network'].summary())

Coefficients of the Logistic Regression
[[-0.40361233 -1.34358455 -2.40555844 -0.21863331 -0.47624875  0.21383911
  -0.11445664  0.49324893  0.66921742 -0.07984428 -0.63432954  1.53655829
   0.20999235  0.55806973  0.62198737 -0.02018114 -0.33324737  0.15586519
   0.42124079  0.34369067 -1.15015701 -1.04390872 -0.32121673 -0.27755068
  -0.74975699]]


Neural Network Summary
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 26)                676       
                                                                 
 dense_1 (Dense)             (None, 13)                351       
                                                                 
 dense_2 (Dense)             (None, 7)                 98        
                                                                 
 dense_3 (Dense)             (None, 4)                 32        
         

<h2>Generating Predictions and Class Probabilities</h2>

Once the models are trained, we generate predictions and class probabilities on the testing set using the `.predict()` and `.predict_proba()` methods, respectively. These methods mirror the functionality of `sklearn` and `tensorflow`.

In [9]:
# Generate predictions and probabilities for each model

data: list = [(X_train, y_train, id_train, 'train'), (X_test, y_test, id_test, 'test')]
    
for X, y, id_, name in data:
    
    # Generate predictions for each model
    predictions_df: pd.DataFrame = Titanic_ML.predict(X)
    predictions_df['y_true'] = y
    predictions_df['PassengerId'] = id_.values
    predictions_df = predictions_df[['PassengerId', 'y_true'] + [model for model in Titanic_ML.models_.keys()]]
    
    # Save the predictions
    predictions_df.to_csv(f'csv/data/{name}_predictions.csv', index = False)
    
    print(f'Predictions on {name}ing set is saved!')
    
    # Generate probabilities for each model
    probabilities_df: pd.DataFrame = Titanic_ML.predict_proba(X)
    probabilities_df['PassengerId'] = id_.values
    probabilities_df = probabilities_df[['PassengerId'] + [model for model in Titanic_ML.models_.keys()]]
    
    # Save the probabilities
    probabilities_df.to_csv(f'csv/data/{name}_survival_rate.csv', index = False)
    
    print(f'Probabilities on {name}ing set is saved!')

print('All predictions and probabilities are saved!')

Generating predictions for each model...
Predicting with dummy...
Predicting with tree...
Predicting with forest...
Predicting with support_vector...
Predicting with neural_network...
Predicting with logistic...
Predicting with gaussian_NB...
Predicting with bernoulli_NB...
Predicting with adaboost...
Predictions on training set is saved!
Generating probabilities for each model...
Computing dummy survival probabilities...
Computing tree survival probabilities...
Computing forest survival probabilities...
Computing support_vector survival probabilities...
Computing neural_network survival probabilities...
Computing logistic survival probabilities...
Computing gaussian_NB survival probabilities...
Computing bernoulli_NB survival probabilities...
Computing adaboost survival probabilities...
All probabilities are computed!
Probabilities on training set is saved!
Generating predictions for each model...
Predicting with dummy...
Predicting with tree...
Predicting with forest...
Predicting wi

Finally, we save the predictions and probabilities into the `csv/data/train_predictions.csv`, `csv/data/test_predictions.csv`, `csv/data/train_predictions.csv`, and `csv/data/test_predictions.csv` files respectively, marking the completion of our model training stage. The next step involves evaluating the performance of each model. Please refer to the `demo_model_evaluation.ipynb` file for more information.