# [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests): Classification example
> "*TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking.*"

This notebook is heavily based on the official tutorial ["*Build, train and evaluate models with TensorFlow Decision Forests*"](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).

First we shall install the `tensorflow_decision_forests` package

## Forked Notebook
[Classification using TensorFlow Decision Forests](https://www.kaggle.com/carlmcbrideellis/classification-using-tensorflow-decision-forests) written by [Carl McBride Ellis](https://www.kaggle.com/carlmcbrideellis)

## Notebook Aim
Extend the analysis performed using the TensorFlow Decision Forests to understand what elements of the model can be tuned. 
***
Initial aim is to review the [minimal](https://github.com/tensorflow/decision-forests/blob/main/examples/minimal.py) baseline model

In [None]:
!pip3 install -q tensorflow_decision_forests

In [None]:
# Import packages and modules
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow_decision_forests as tfdf

from sklearn.model_selection import train_test_split

In [None]:
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

In [None]:
# Read in the data
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv',index_col=0)
test  = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv', index_col=0)

train.head()

In [None]:
train.shape

In [None]:
# Understand the variable types
train.dtypes.value_counts()

In [None]:
# Understand if there are any missing values present
train.isnull().sum()

In [None]:
# The Neural Network does not work well with Numerical missing values. Set to 0. This initial adjustment boosted the score of the model.
# Lets try using an alternative measure
# 1st option - replace with zero value
# 2nd option - replace with mean value
def replace_missing(df):
    for col in df.columns:
        if df[col].dtype not in [str, object] and method == 'zero':
            df[col] = df[col].fillna(0)

# Default method of using a zero value
# replace_missing(train)
# replace_missing(test)
# Second option of using mean value
train.fillna(value=train.mean(), inplace=True)
test.fillna(value=test.mean(), inplace=True)

In [None]:
# Check for missing
train.isnull().sum()
# test.isnull().sum()

In [None]:
train.head(5)

In [None]:
# Use only 25% of the training data in this example - original method
train_data      = train.sample(frac=0.25, random_state=42)
validation_data = train.drop(train_data.index).sample(frac=0.05, random_state=42)

In [None]:
# Split the dataset into a training and a testing dataset.
# def split_dataset(dataset, test_ratio=0.30):
#     """Splits a panda dataframe in two."""
#     test_indices = np.random.rand(len(dataset)) < test_ratio
#     return dataset[~test_indices], dataset[test_indices]


# train_ds_pd, val_ds_pd = split_dataset(train)
# print("{} examples in training, {} examples for testing.".format(
#     len(train_ds_pd), len(val_ds_pd)))

In [None]:
train_data['claim'].value_counts().to_frame().T

In [None]:
validation_data['claim'].value_counts().to_frame().T

In [None]:
# Convert the dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    train_data, label="claim"
)                                          
val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    validation_data, label="claim"
)                                 
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test)

Lets try out the [`tfdf.keras.RandomForestModel`](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)
The previous notebook used the [`tfdf.keras.GradientBoostedTreesModel`](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel). Lets try to Hyperparameter tune this later.

***
Other model [`tfdf.keras.CartModel`](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)

In [None]:
%%time

# Train a Random Forest model.
model = tfdf.keras.RandomForestModel()

# Add evaluation metrics
model.compile(
    metrics=["accuracy"]
)
model.fit(x=train_ds)

# # Train a Gradient Boosted Trees model.
# model = tfdf.keras.GradientBoostedTreesModel(num_trees=1500)
# model.fit(train_ds)

In [None]:
# Evaluate the model
evaluate = model.evaluate(val_ds, return_dict=True)
print()

for name, value in evaluate.items():
    print(f"{name}: {value:.4f}")

In [None]:
# Model Summary
model.summary()

In [None]:
# Model features
model.make_inspector().features()

In [None]:
# Feature importance
model.make_inspector().variable_importances()

In [None]:
# Model self evaluation
model.make_inspector().evaluation()

In [None]:
logs = model.make_inspector().training_logs()

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")
plt.show()

Calculate the score of our hold-out validation dataset

In [None]:
predictions = model.predict(val_ds)
y_true      = validation_data["claim"]

from sklearn.metrics import roc_auc_score
ROC_AUC = roc_auc_score(y_true, predictions)
print("The ROC AUC score is %.5f" % ROC_AUC )

Now write out a `submission.csv`

In [None]:
sample          = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')
sample['claim'] = model.predict(test_ds)
sample.to_csv('submission_mean_miss.csv',index=False)

### Future work
Perform hyperparameter testing on the Random Forest and the Decision Forest

In [None]:
# Re-train the model with a different learning algorithm
tfdf.keras.get_all_models()

# Related reading
* [Introducing TensorFlow Decision Forests](https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html)
* [TensorFlow Decision Forests](https://github.com/tensorflow/decision-forests) GitHub
* [Yggdrasil Decision Forests](https://github.com/google/yggdrasil-decision-forests) GitHub

**Related kaggle notebooks**

* ["*Decision Forest for dummies*"](https://www.kaggle.com/kritidoneria/decision-forest-for-dummies) written by [KritiDoneria](https://www.kaggle.com/kritidoneria) and [Laurent Pourchot](https://www.kaggle.com/pourchot)
* ["*Decision Forest fed by Neural Network*"](https://www.kaggle.com/pourchot/decision-forest-fed-by-neural-network) written by [Laurent Pourchot](https://www.kaggle.com/pourchot)