# What is a Tree?

At its simplest form a Tree can be construed as multiple if/else statements through which each row from the data is passed to check all the features to decide/classify which category the row belongs.

# Tensorfow Decision Forests

Tensorflow Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you’re working with tabular data. They’re built from many decision trees, which makes them easy to use and understand - and you can take advantage of a plethora of interpretability tools and techniques that already exist today.

- It provides a slew of state-of-the-art Decision Forest training and serving algorithms such as random forests, gradient-boosted trees, CART, (Lambda)MART, DART, Extra Trees, greedy global growth, oblique trees, one-side-sampling, categorical-set learning, random categorical learning, out-of-bag evaluation and feature importance, and structural feature importance.

- This library can serve as a bridge to the rich TensorFlow ecosystem by making it easier for you to integrate tree-based models with various TensorFlow tools, libraries, and platforms such as TFX.

For more info please check -> https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html

### To Demonstrate TF Decision Forests we use Stroke Prediction Dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams['axes.titlesize'] = 16
plt.style.use('seaborn-whitegrid')
sns.set_palette('Set3')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
df['bmi'].isna().sum()

There are 201 null values in feature 'bmi'

Usually if we are to use sklearn RandomForest we would have to impute the NaNs, scale the features and convert categorical to numerical features before procedding with fitting the model. In TF DF we can straight away fit the model as demonstrated below

#### First install tensorflow_decision_forests package

In [None]:
!pip install tensorflow_decision_forests -q

##### We split the datasset into training and validation set

In [None]:
train_df, valid_df = train_test_split(df, test_size = 0.2, shuffle = True, random_state = 42)
train_df.shape, valid_df.shape

##### First step is to convert the pandas dataframe format to tensorflow decision forests format as below

In [None]:
import tensorflow_decision_forests as tfd

train_tf = tfd.keras.pd_dataframe_to_tf_dataset(train_df, label = 'stroke')

##### Below are the available models in Tensorflow Decision Forest

In [None]:
tfd.keras.get_all_models()

In [None]:
#We first demo using RandomForest
#Define the required model
model = tfd.keras.RandomForestModel()

#Train the model
model.fit(x = train_tf)

In [None]:
model.summary()

##### Evluatate the model using the validation data

In [None]:
valid_tf = tfd.keras.pd_dataframe_to_tf_dataset(valid_df, label = 'stroke')

model.compile(metrics = ["accuracy"])
ev = model.evaluate(valid_tf)

- The first entry that model.evaluate returns is the BinaryCrossEntropyLoss
- The second entry is the eval metric we supplied while compiling the model (accuracy)

In [None]:
print(f"BinaryCross Entropy Loss: {ev[0]}")
print(f"Accuracy: {ev[1]}")

#Save model
model.save('./stoke_model')

### Training Logs Plot

In [None]:
logs = model.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Out-of-bag accuracy")
plt.show()

## Feature Importance

In [None]:
inspector = model.make_inspector()
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
    print(importance)

In [None]:
# Mean decrease in AUC of the class 1 vs the others.
inspector.variable_importances()["NUM_AS_ROOT"]

### Model Explainability

In [None]:
with open('./plot_model.html', 'w') as f:
    f.write(tfd.model_plotter.plot_model(model))

In [None]:
from IPython.display import IFrame

IFrame('./plot_model.html', width = 900, height = 700)

- The model starts with bmi >=26.75 and then branches off to check hypertension and age to decide which class it belongs to
- if age >= 72.5 and bmi >= 31, the model decides more class 1 compared to other nodes

#### We now use GradientBoostTree model for the same dataset with some parameter tuning

In [None]:
model_gb = tfd.keras.GradientBoostedTreesModel(
    num_trees = 300,
    growing_strategy = "BEST_FIRST_GLOBAL",
    max_depth = 12,
    split_axis = "SPARSE_OBLIQUE",
    )

model_gb.fit(train_tf)
model_gb.compile(metrics = ["accuracy"])
ev = model_gb.evaluate(valid_tf)

print(f"BinaryCross Entropy Loss: {ev[0]}")
print(f"Accuracy: {ev[1]}")

In [None]:
model_gb.make_inspector().variable_importances()

### Let us check Sklearn RandomForest for comparision

In [None]:
df.head()

In [None]:
X = df.drop(['id', 'stroke'], axis = 1)
y = df['stroke'].copy()

In [None]:
num_cols = [c for c in X.columns if X[c].dtype in ['int64', 'float64']]
cat_cols = [c for c in X.columns if c not in num_cols]
num_cols, cat_cols

##### NaN Imputation

In [None]:
for c in num_cols:
    X[c] = X[c].fillna(X[c].mean())

In [None]:
#Scaling

from sklearn.preprocessing import StandardScaler

std = StandardScaler()

X[num_cols] = std.fit_transform(X[num_cols])

##### Label Encoding Categorical Features

In [None]:
from sklearn.preprocessing import LabelEncoder

lbl = LabelEncoder()

for c in cat_cols:
    lbl.fit(X[c])
    X[c] = lbl.transform(X[c])

In [None]:
Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size = 0.2, random_state = 42)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
clf.fit(Xtrain, ytrain)
preds = clf.predict(Xvalid)

from sklearn.metrics import accuracy_score

print(f"Accuracy: {accuracy_score(yvalid, preds)}")

In [None]:
for name, importance in zip(df.columns, clf.feature_importances_):
    print(name, '-->', importance)