## Tensorfow Decision Forests

Tensorflow Decision Forests are the newest API from TF family to work with tabular data, this makes it easier to integrate with other NNs.

Tensorflow Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you’re working with tabular data. They’re built from many decision trees, which makes them easy to use and understand - and you can take advantage of a plethora of interpretability tools and techniques that already exist.

- It provides a slew of state-of-the-art Decision Forest training and serving algorithms such as random forests, gradient-boosted trees, CART, (Lambda)MART, DART, Extra Trees, greedy global growth, oblique trees, one-side-sampling, categorical-set learning, random categorical learning, out-of-bag evaluation and feature importance, and structural feature importance.

- This library can serve as a bridge to the rich TensorFlow ecosystem by making it easier for you to integrate tree-based models with various TensorFlow tools, libraries, and platforms such as TFX.

For more info please check -> https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams['axes.titlesize'] = 16
plt.style.use('seaborn-whitegrid')
sns.set_palette('Set3')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from time import time, strftime, gmtime
start = time()
import datetime
print(str(datetime.datetime.now()))

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/train.csv')
print(train.shape)
train.head()

In [None]:
test = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/test.csv')
print(test.shape)
test.head()

In [None]:
sub = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/sample_submission.csv')
print(sub.shape)
sub.head()

In [None]:
ax = sns.countplot(data = train, x = 'target')
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

In [None]:
train.info(), test.info()

In [None]:
train.describe().T

In [None]:
test.describe().T

__Install Tensorflow Decision Forest API__

In [None]:
!pip install tensorflow_decision_forests -q

In [None]:
train = train.sample(frac = 0.1).reset_index(drop = True)
train.shape

In [None]:
train_df, valid_df = train_test_split(train, test_size = 0.2, shuffle = True, stratify = train['target'].values, 
                                      random_state = 42)
train_df.shape, valid_df.shape

__Below are the available models in Tensorflow Decision Forest__

In [None]:
import tensorflow_decision_forests as tfd

tfd.keras.get_all_models()

__Convert pandas dataframe to TF format__

In [None]:
train_tf = tfd.keras.pd_dataframe_to_tf_dataset(train_df, label = 'target')

In [None]:
model = tfd.keras.GradientBoostedTreesModel(
    num_trees = 500,
    growing_strategy = "BEST_FIRST_GLOBAL",
    max_depth = 8,
    split_axis = "SPARSE_OBLIQUE",
    categorical_algorithm = "RANDOM",
    )

List of hyperparameters can be accessed using below command

In [None]:
?tfd.keras.GradientBoostedTreesModel

In [None]:
%%time
#Train the model
model.fit(x = train_tf)

In [None]:
model.summary()

__Evaluation__

In [None]:
valid_tf = tfd.keras.pd_dataframe_to_tf_dataset(valid_df, label = 'target')

model.compile(metrics = ["accuracy"])
ev = model.evaluate(valid_tf)

- The first entry that model.evaluate returns is the BinaryCrossEntropyLoss
- The second entry is the eval metric we supplied while compiling the model (accuracy)

In [None]:
print(f"BinaryCross Entropy Loss: {ev[0]}")
print(f"Accuracy: {ev[1]}")

#Save model
model.save('./tps_model')

### Training Logs Plot

In [None]:
inspector = model.make_inspector()

In [None]:
print('Model Meta-data:\n')
print(f"Model type: {inspector.model_type()}")
print(f"Number of Trees: {inspector.num_trees()}")
print(f"Input features: {inspector.features()}")

In [None]:
train_logs = inspector.training_logs()

plt.plot([log.num_trees for log in train_logs], [log.evaluation.accuracy for log in train_logs])
plt.xlabel("Number of Trees")
plt.ylabel("Valid Accuracy")
plt.show()

plt.plot([log.num_trees for log in train_logs], [log.evaluation.loss for log in train_logs])
plt.xlabel("Number of Trees")
plt.ylabel("Valid Logloss)")
plt.show()

### Feature Importance

In [None]:
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
    print(importance)

### Plot Model

In [None]:
with open('plot_model.html', 'w') as f: 
    f.write(tfd.model_plotter.plot_model(model))

In [None]:
from IPython.display import IFrame

IFrame(src = './plot_model.html', width = 800, height = 500)

### Prediction

In [None]:
test_tf = tfd.keras.pd_dataframe_to_tf_dataset(test)
preds = model.predict(test_tf)
preds

In [None]:
sub.loc[:, 1:] = preds
sub.to_csv('./submission.csv', index = False)
sub.head()

In [None]:
finish = time()
print(strftime("%H:%M:%S", gmtime(finish - start)))