## TensorFlow Decision Forests

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking.

TF-DF is a wrapper around the Yggdrasil Decision Forest C++ libraries. Models trained with TF-DF are compatible with Yggdrasil Decision Forests' models, and vice versa.

In this notebook we are going to compare TensorFlow Decision Forests models with Scikit-learn (sklearn) models, LightGBM, CatBoost and XGBoost Classifier models. All the models are run with default parameters and their Accuracy on test set is measured for comparision.

In [None]:
!pip install -q tensorflow_decision_forests

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

import os
import random
import warnings


def seed_everything(seed):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)


warnings.filterwarnings("ignore")
seed_everything(42)

In [None]:
import tensorflow_decision_forests as tfdf

In [None]:
tfdf.keras.get_all_models()

## PIMA Indians Diabetes Dataset

The datasets consists of several medical predictor variables and one target variable, Outcome. 

Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [None]:
# Reading the dataset

df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

In [None]:
df.head()

In [None]:
# Splitting the dataset

train_df, test_df = train_test_split(
    df, test_size=0.3, stratify=df["Outcome"], random_state=42
)

In [None]:
# Convert the dataset into a TensorFlow dataset

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="Outcome")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="Outcome")

## TF-DF Random Forest Model

In [None]:
# Train a Random Forest model

model_rf = tfdf.keras.RandomForestModel()
model_rf.fit(train_ds)

In [None]:
# Summary of the model structure

model_rf.summary()

In [None]:
# Evaluate the model

preds_rf = np.where(model_rf.predict(test_ds) < 0.5, 0, 1).ravel()

acc_rf = accuracy_score(test_df["Outcome"].values, preds_rf)

print(f"Test set accuracy of Random Forest model is {acc_rf:.6f}")

## TF-DF Gradient Boosted Trees Model

In [None]:
# Train a Gradient Boosted Trees Model

model_gbt = tfdf.keras.GradientBoostedTreesModel()
model_gbt.fit(train_ds)

In [None]:
# Summary of the model structure

model_gbt.summary()

In [None]:
# Evaluate the model

preds_gbt = np.where(model_gbt.predict(test_ds) < 0.5, 0, 1).ravel()

acc_gbt = accuracy_score(test_df["Outcome"].values, preds_gbt)

print(f"Test set accuracy of Gradient Boosted Trees model is {acc_gbt:.6f}")

## TF-DF CART Model

In [None]:
# Train a CART Model

model_cart = tfdf.keras.CartModel()
model_cart.fit(train_ds)

In [None]:
# Summary of the model structure

model_cart.summary()

In [None]:
# Evaluate the model

preds_cart = np.where(model_cart.predict(test_ds) < 0.5, 0, 1).ravel()

acc_cart = accuracy_score(test_df["Outcome"].values, preds_cart)

print(f"Test set accuracy of CART model is {acc_cart:.6f}")

## Scikit-learn Models

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_rf_sk = RandomForestClassifier()

model_rf_sk.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values)

acc_rf_sk = accuracy_score(
    test_df["Outcome"].values, model_rf_sk.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of sklearn's Random Forest Classifier model is {acc_rf_sk:.6f}"
)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model_gbt_sk = GradientBoostingClassifier()

model_gbt_sk.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values)

acc_gbt_sk = accuracy_score(
    test_df["Outcome"].values, model_gbt_sk.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of sklearn's Random Forest Classifier model is {acc_gbt_sk:.6f}"
)

In [None]:
from sklearn.tree import DecisionTreeClassifier

model_dtc_sk = DecisionTreeClassifier()

model_dtc_sk.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values)

acc_dtc_sk = accuracy_score(
    test_df["Outcome"].values, model_dtc_sk.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of sklearn's Decision Tree Classifier model is {acc_dtc_sk:.6f}"
)

## LightGBM Classifier model

In [None]:
from lightgbm import LGBMClassifier

model_lgb = LGBMClassifier()

model_lgb.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values)

acc_lgb = accuracy_score(
    test_df["Outcome"].values, model_lgb.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of LightGBM's Classifier model is {acc_lgb:.6f}"
)

## CatBoost Classifier model

In [None]:
from catboost import CatBoostClassifier

model_cb = CatBoostClassifier()

model_cb.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values, silent=True)

acc_cb = accuracy_score(
    test_df["Outcome"].values, model_cb.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of CatBoost's Classifier model is {acc_cb:.6f}"
)

## XGBoost Classifier model

In [None]:
from xgboost import XGBClassifier

model_xgb = XGBClassifier()

model_xgb.fit(train_df.drop(["Outcome"], axis=1), train_df["Outcome"].values, verbose = 0)

acc_xgb = accuracy_score(
    test_df["Outcome"].values, model_xgb.predict(test_df.drop(["Outcome"], axis=1))
)

print(
    f"Test set accuracy of XGBoost's Classifier model is {acc_xgb:.6f}"
)

## Accuracy Comparision

In [None]:
models = pd.DataFrame(
    {
        "Model": [
            "TF-DF Random Forest",
            "TF-DF Gradient Boosted Trees",
            "TF-DF CART",
            "Sklearn Random Forest",
            "Sklearn Gradient Boosted Trees",
            "Sklearn Decision Tree",
            "LightGBM Classifier",
            "CatBoost Classifier",
            "XGBoost Classifier",
        ],
        "Score": [
            acc_rf,
            acc_gbt,
            acc_cart,
            acc_rf_sk,
            acc_gbt_sk,
            acc_dtc_sk,
            acc_lgb,
            acc_cb,
            acc_xgb,
        ],
    }
)

models.sort_values(by="Score", ascending=False).reset_index(drop=True)

## Further reading

- [TensorFlow (TF-DF) Decision Forest on Github](https://github.com/tensorflow/decision-forests)
- [TensorFlow Decision Forests tutorials](https://www.tensorflow.org/decision_forests/tutorials)