# TF Decision Forests on Spaceship Titanic Dataset

In this notebook, main aim is to create tree based model for predicting whether the passengers will get **transported** to the alternate dimension using new ML library open sourced by Google - *TensorFlow Decision Forests*. It was released just an year ago. 

As the name suggests, this library is built on traditional machine learning algorithm - **Decision Trees** as its building block. Popular decision tree models that has been widely used for winning Kaggle competitions are based on ensembling techniques - **Random Forests** which is bagging model and **Gradient Boosted Decision Trees** which is boosting model. Both of these models are available on in this library.

![GradientBoosting](https://miro.medium.com/max/1400/1*Rn-u1k5_8O4Vk7HQrPiX6w.png)

Advantages as follows:
- For tabular data, Decision Forests outforms Deep Learning methods.
- Preprocessing steps - one-hot encoding, normalization and handling of missing values is not required for basic ml models as it is natively supported by this library. This will save considerable amount of time.
- It is very easy to use just like scikit-learn model. Therefore, beginners will be able to start developing their models in no time and also be able to easily explain the decision forest models.
- It differs from sklearn library in is that these models are more optimized internally. Therefore, we wont have to do extensive hyperparameter tuning.
- For advanced users, the library provides easy interfacing for combining Neural Network models and Decision Trees.

## 0. Getting tools ready

In [None]:
!pip3 install -q tensorflow_decision_forests
!pip3 install -q klib

In [None]:
# Data Handling and Manipulation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import klib

# Machine Library
import tensorflow_decision_forests as tfdf #

# To split the data into train and validation sets
from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1. Loading the Dataset

In [None]:
train_df = pd.read_csv('../input/spaceship-titanic/train.csv')
test_df = pd.read_csv('../input/spaceship-titanic/test.csv')
submission_id = test_df.PassengerId

In [None]:
train_df.sample(5, random_state=42)

In [None]:
train_df.info()

## 2. Exploratory Data Analysis

### 2.1 Categorical Data

In [None]:
klib.cat_plot(train_df.drop(['CryoSleep', 'Transported', 'VIP'], axis=1), top=5, bottom=5)

### 2.2 Numerical Data

In [None]:
klib.dist_plot(train_df)

### 2.3 Visualization of Missing Values

In [None]:
klib.missingval_plot(train_df)

### 2.4 Correlation Matrix Plot among the other features and target feature

In [None]:
klib.corr_plot(train_df)

In [None]:
klib.corr_plot(train_df, target='Transported')

## 3. Perform Train and Validation Dataset Split

In [None]:
train_df_1, valid_df = train_test_split(train_df, test_size=.2, random_state=42, stratify=train_df['Transported'])

## 4. Feature Engineering

The existing features may have hidden information that cannot be interpreted by our models. Therefore, we can create more meaningful features that have significant or more impact on the target feature than their parent features.

<b>Note:</b> As categorical and missing values are handled well by `tensorflow decision trees`, we will not be handling missing values.

### 4.1 Creating New Features - `Group` and `Num_members` 

Based on the information given to us, first 4 characters from `PassengerId` feature represent a group and 2 characters after the underscore represents number of members.

In [None]:
train_df_1 =  pd.concat([train_df_1, train_df_1['PassengerId'].str.split('_', expand=True)], axis=1)
train_df_1.rename({0: 'Group', 1: 'Num_members'}, axis=1, inplace=True)
train_df_1['Num_members'] = train_df_1['Num_members'].astype(int)

In [None]:
train_df_1['Num_members'].unique()

In [None]:
valid_df =  pd.concat([valid_df, valid_df['PassengerId'].str.split('_', expand=True)], axis=1)
valid_df.rename({0: 'Group', 1: 'Num_members'}, axis=1, inplace=True)
valid_df['Num_members'] = valid_df['Num_members'].astype(int)

In [None]:
test_df =  pd.concat([test_df, test_df['PassengerId'].str.split('_', expand=True)], axis=1)
test_df.rename({0: 'Group', 1: 'Num_members'}, axis=1, inplace=True)
test_df['Num_members'] = test_df['Num_members'].astype(int)

In [None]:
train_df_1.sample(3, random_state=42)

In [None]:
valid_df.sample(3, random_state=42)

In [None]:
test_df.sample(3, random_state=42)

Now, we will no more need `PassengerId` feature.

In [None]:
train_df_1.drop(['PassengerId'], axis=1, inplace=True)
valid_df.drop(['PassengerId'], axis=1, inplace=True)
test_df.drop(['PassengerId'], axis=1, inplace=True)

### 4.2 Creating New Features - `Deck`, `Cabin_No` and `Side`

In [None]:
train_df_1 = pd.concat([train_df_1, train_df_1['Cabin'].str.split('/', expand=True)], axis=1)
valid_df = pd.concat([valid_df, valid_df['Cabin'].str.split('/', expand=True)], axis=1)
test_df = pd.concat([test_df, test_df['Cabin'].str.split('/', expand=True)], axis=1)

In [None]:
cabin_mapper = {0: 'Deck', 1: 'Cabin_num', 2: 'Side'}
train_df_1.rename(cabin_mapper, axis=1, inplace=True)
valid_df.rename(cabin_mapper, axis=1, inplace=True)
test_df.rename(cabin_mapper, axis=1, inplace=True)

In [None]:
train_df_1.sample(3, random_state=42)

In [None]:
valid_df.sample(3, random_state=42)

In [None]:
test_df.sample(3, random_state=41)

`Cabin` feature is no more required. Therefore, we will proceed to remove it.

In [None]:
train_df_1.drop('Cabin', axis=1, inplace=True)
valid_df.drop('Cabin', axis=1, inplace=True)
test_df.drop('Cabin', axis=1, inplace=True)

More feature engineering can be done. But we will leave the other learners to come up with it.

Boolean values are not supported by tensorflow library. Therefore, we need to convert them into numerical values: 0 - False & 1 - True.

In [None]:
transported = {True: 1, False: 0}
train_df_1['Transported'] = train_df_1['Transported'].map(transported)
valid_df['Transported'] = valid_df['Transported'].map(transported)

In [None]:
cryosleep = {True: 1, False: 0}
train_df_1['CryoSleep'] = train_df_1['CryoSleep'].map(cryosleep)
valid_df['CryoSleep'] = valid_df['CryoSleep'].map(cryosleep)
test_df['CryoSleep'] = test_df['CryoSleep'].map(cryosleep)

In [None]:
is_vip = {True: 1, False: 0}
train_df_1['VIP'] = train_df_1['VIP'].map(is_vip)
valid_df['VIP'] = valid_df['VIP'].map(is_vip)
test_df['VIP'] = test_df['VIP'].map(is_vip)

In [None]:
train_df_1.sample(3, random_state=42)

In [None]:
valid_df.sample(3, random_state=42)

In [None]:
test_df.sample(3, random_state=41)

Now, the dataset is ready to be converted into tensorflow datasets.

In [None]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df_1, label="Transported")
validation_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_df, label="Transported")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df)

## 5. Training Model using TensorFlow Decision Forests

### 5.1 Using RandomForestModel 

In [None]:
# Specify the model.
model_1 = tfdf.keras.RandomForestModel()

# Train the model.
model_1.fit(x=train_ds)

# Evaluate the model
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

For a base model, 80% is a good score. Lets perform Hyperparameter Tuning and evaluate the accuracy.

In [None]:
model_2 = tfdf.keras.RandomForestModel(
    num_trees=300,
    growing_strategy="BEST_FIRST_GLOBAL",
    max_depth=11,
    split_axis="SPARSE_OBLIQUE",
    categorical_algorithm="RANDOM",
)

# Train the model.
model_2.fit(x=train_ds)

# Evaluate the model
model_2.compile(metrics=["accuracy"])
evaluation = model_2.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

It gets cumbersome to manually enter values and change the default parameters to get best accuracy score though. Google developers we already came up with good combinations of the hyperparameters that gives better results than default parameters which has been indexed and are available as hyperparameter templates.

In [None]:
# The hyper-parameter templates of the Random Forest Tree model.
print(tfdf.keras.RandomForestModel.predefined_hyperparameters())

In [None]:
# Specify the model.
model_3 = tfdf.keras.RandomForestModel(hyperparameter_template="better_default")

# Train the model.
model_3.fit(x=train_ds)

# Evaluate the model
model_3.compile(metrics=["accuracy"])
evaluation = model_3.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

In [None]:
# Specify the model.
model_4 = tfdf.keras.RandomForestModel(hyperparameter_template="benchmark_rank1")

# Train the model.
model_4.fit(x=train_ds)

# Evaluate the model
model_4.compile(metrics=["accuracy"])
evaluation = model_4.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

### 5.2 Using GradientBoostedTreesModel

In [None]:
# Specify the model.
model_5 = tfdf.keras.GradientBoostedTreesModel()

# Train the model.
model_5.fit(x=train_ds)

# Evaluate the model
model_5.compile(metrics=["accuracy"])
evaluation = model_5.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

Lets perform Hyperparameter Tuning and evaluate the accuracy.

In [None]:
# The hyper-parameter templates of the Random Forest Tree model.
print(tfdf.keras.GradientBoostedTreesModel.predefined_hyperparameters())

In [None]:
# Specify the model.
model_6 = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="better_default")

# Train the model.
model_6.fit(x=train_ds)

# Evaluate the model
model_6.compile(metrics=["accuracy"])
evaluation = model_6.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

In [None]:
# Specify the model.
model_7 = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")

# Train the model.
model_7.fit(x=train_ds)

# Evaluate the model
model_7.compile(metrics=["accuracy"])
evaluation = model_7.evaluate(validation_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")

## 6. Conclusions and Recommendation

Just by adjusting few parameter(s), we could see there was an improvement in accuracy for Random Forest Model but for Gradient Boosting Trees Model it is other way round. Model can be further improved by further performing feature engineering and do more rounds of hyperparameter tuning.As this competition just evaluates on accuracy, we are not evaluating this model based on other metrics such as: 
- Receiver Operating Characteristics(ROC) Curve
- Precision, Recall and F1 or F-Beta score
- Confusion Matrix

Accuracy itself doesn't represent goodness of a model. Therefore, we need to perform above the metrics evaluation as well. This I am leaving upto the readers who will go through this notebook.

## 7. Resources

- Official Tensorflow Tutorial: https://www.tensorflow.org/decision_forests/tutorials/beginner_colab
- Klib for EDA: https://klib.readthedocs.io/en/latest/

## 8. Submission

In [None]:
def proba_output(num):
    if num >= .4999:
        return True
    else:
        return False

As model 5 yielded highest accuracy out of all the models in the validation process, we will be using it for submission.

In [None]:
y_preds = model_5.predict(test_ds).reshape(test_df.shape[0],)
output = list(map(proba_output, y_preds))
submission_df = pd.DataFrame({'PassengerId': submission_id, 'Transported': output})
submission_df.to_csv('submission.csv', index=False)