# Advanced Colab for TensorFlow Decision Forests

In this colab, you will learn how to inspect and create the structure of a model directly. We assume you are familiar with the concepts introduced in the
[beginner](beginner_colab.ipynb) and [intermediate](intermediate_colab.ipynb)
colabs.

In this colab, we will:

1.  Train a Random Forest model and access its structure programatically.

1.  Create a Random Forest model by hand and use it as a classical model.

In [None]:
# Install TensorFlow Dececision Forests
!pip install tensorflow_decision_forests

## Importing the libraries

In [None]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math
import collections

## Train a simple Random Forest (same as Beginer colab)

We train a Random Forest like in the [beginner colab](beginner_colab.ipynb):

In [None]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe.
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Show the first three examples.
print(dataset_df.head(3))

# Convert the pandas dataframe into a tf dataset.
dataset_tf = tfdf.keras.pd_dataframe_to_tf_dataset(dataset_df, label="species")

# Train the Random Forest
model = tfdf.keras.RandomForestModel(compute_oob_variable_importances=True)
model.fit(x=dataset_tf)

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007


Remark the `compute_oob_variable_importances=True`
hyper-parameter in the model constructor. This opion computes the Out-of-bag (OOB)
variable importance during training. This is a popular
[permutation variable importance](https://christophm.github.io/interpretable-ml-book/feature-importance.html) for Random Forest models.

Computing the OOB Variable importance not impact the final model, it will slow the training on large datasets.

We check the model summary:

In [None]:
%output_height 300px
model.summary()

<IPython.core.display.Javascript at 0x7faed9bfbe48>

Model: "random_forest_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (7):
	bill_depth_mm
	bill_length_mm
	body_mass_g
	flipper_length_mm
	island
	sex
	year

No weights

Variable Importance: MEAN_DECREASE_IN_PRAUC_3_VS_OTHERS:
    1.            "island"  0.002854 ################
    2.    "bill_length_mm"  0.001035 #######
    3.     "bill_depth_mm"  0.000707 #####
    4.       "body_mass_g"  0.000110 ###
    5.              "year"  0.000000 ##
    6.               "sex"  0.000000 ##
    7. "flipper_length_mm" -0.000539 

Variable Importance: MEAN_DECREASE_IN_AP_1_VS_OTHERS:
    1.    "bill_length_mm"  0.086531 ################
    2. "flipper_length_mm"  0.005352 
    3.            "island"  0.00

Remark the multiple variable importances with name `MEAN_DECREASE_IN_*`.

## Plotting the model

Next, we plot our model.

A Random Forest is a large model (this model has 300 trees and ~5k nodes; see the summary above). Therefore, we will only plot the first tree, and limit the nodes to depth 3.

In [None]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)

## Inspect the model structure

The model structure and meta-data is
available through the **inspector** created by `make_inspector()`.

**Note:** Depending on the learning algorithm and hyper-parameters, the
inspector will expose different specialized attributes. For examples, the
`winner_take_all` field is specific to Random Forest models.

In [None]:
inspector = model.make_inspector()

For our model, the available inspector fields are:

In [None]:
[field for field in dir(inspector) if not field.startswith("_")]

['MODEL_NAME',
 'dataspec',
 'evaluation',
 'extract_tree',
 'features',
 'iterate_on_nodes',
 'label',
 'label_classes',
 'model_type',
 'num_trees',
 'objective',
 'specialized_header',
 'task',
 'variable_importances',
 'winner_take_all_inference']

Remember to use `?` for the online documentation :).

In [None]:
?inspector.model_type

Some of the model meta-data:

In [None]:
print("Model type:", inspector.model_type())
print("Number of trees:", inspector.num_trees())
print("Objective:", inspector.objective())
print("Input features:", inspector.features())

Model type: RANDOM_FOREST
Number of trees: 300
Objective: Classification(label=__LABEL, class=None, num_classes=5)
Input features: ["bill_depth_mm" (1; #0), "bill_length_mm" (1; #1), "body_mass_g" (1; #2), "flipper_length_mm" (1; #3), "island" (4; #4), "sex" (4; #5), "year" (1; #6)]


`evaluate()` is the evaluation of the model computed during training. The dataset used for this evaluation depends on the algorithm. For example, it can be the validation dataset or the out-of-bag-dataset .

**Note:** While computed during training, `evaluate()` is never an evaluation on the
training dataset.

In [None]:
inspector.evaluation()

Evaluation(num_examples=344, accuracy=0.9854651162790697, loss=None, rmse=None, ndcg=None, aucs=None)

The variable importances are:

In [None]:
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)

Available variable importances:
	 MEAN_DECREASE_IN_AUC_2_VS_OTHERS
	 MEAN_DECREASE_IN_PRAUC_2_VS_OTHERS
	 MEAN_DECREASE_IN_PRAUC_3_VS_OTHERS
	 MEAN_DECREASE_IN_AUC_1_VS_OTHERS
	 MEAN_DECREASE_IN_AP_3_VS_OTHERS
	 MEAN_DECREASE_IN_AP_2_VS_OTHERS
	 MEAN_DECREASE_IN_ACCURACY
	 MEAN_DECREASE_IN_AP_1_VS_OTHERS
	 MEAN_DECREASE_IN_PRAUC_1_VS_OTHERS
	 MEAN_DECREASE_IN_AUC_3_VS_OTHERS
	 NUM_AS_ROOT


Different variable importances have different semantics. For example, a feature
with a **mean decrease in auc** of `0.05` means that removing this feature from
the training dataset would reduce/hurt the AUC by 5%.

In [None]:
# Mean decrease in AUC of the class 1 vs the others.
inspector.variable_importances()["MEAN_DECREASE_IN_AUC_1_VS_OTHERS"]

[("bill_length_mm" (1; #1), 0.0674170778508777),
 ("flipper_length_mm" (1; #3), 0.004591557017544323),
 ("island" (4; #4), 0.0036321271929831145),
 ("bill_depth_mm" (1; #0), 0.0016790021929830035),
 ("body_mass_g" (1; #2), 0.0005825109649123528),
 ("sex" (4; #5), 0.0002912554824563429),
 ("year" (1; #6), 5.139802631570767e-05)]

Finaly, we access the actual tree structure:

In [None]:
inspector.extract_tree(tree_idx=0)

Tree(NonLeafNode(condition=(bill_depth_mm >= 16.349998474121094; miss=True), pos_child=NonLeafNode(condition=(bill_length_mm >= 42.349998474121094; miss=True), pos_child=NonLeafNode(condition=(body_mass_g >= 4975.0; miss=False), pos_child=LeafNode(value=ProbabilityValue([0.0, 0.0, 1.0],n=10.0)), neg_child=NonLeafNode(condition=(island in ['Biscoe', 'Torgersen']; miss=True), pos_child=NonLeafNode(condition=(flipper_length_mm >= 198.95761108398438; miss=True), pos_child=LeafNode(value=ProbabilityValue([0.8333333333333334, 0.0, 0.16666666666666666],n=6.0)), neg_child=LeafNode(value=ProbabilityValue([1.0, 0.0, 0.0],n=14.0)), value=ProbabilityValue([0.95, 0.0, 0.05],n=20.0)), neg_child=NonLeafNode(condition=(bill_length_mm >= 44.650001525878906; miss=False), pos_child=LeafNode(value=ProbabilityValue([0.0, 1.0, 0.0],n=49.0)), neg_child=LeafNode(value=ProbabilityValue([0.3333333333333333, 0.6666666666666666, 0.0],n=9.0)), value=ProbabilityValue([0.05172413793103448, 0.9482758620689655, 0.0],n

Extracting a tree is not efficient. If speed is important, the model inspection can be done with the `iterate_on_nodes()` method instead. This method is a Depth First Pre-order traversals iterator on all the nodes of the model.

**Note:** `extract_tree()` is implemented using `iterate_on_nodes()`.

For following example computes how many times each feature is used (this is a
kind of structural variable importance):

In [None]:
# number_of_use[F] will be the number of node using feature F in its condition.
number_of_use = collections.defaultdict(lambda: 0)

# Iterate over all the nodes in a Depth First Pre-order traversals.
for node_iter in inspector.iterate_on_nodes():

  if not isinstance(node_iter.node, tfdf.py_tree.node.NonLeafNode):
    # Skip the leaf nodes
    continue

  # Iterate over all the features used in the condition.
  # By default, models are "oblique" i.e. each node tests a single feature.
  for feature in node_iter.node.condition.features():
    number_of_use[feature] += 1

print("Number of condition nodes per features:")
for feature, count in number_of_use.items():
  print("\t", feature.name, ":", count)

Number of condition nodes per features:
	 bill_depth_mm : 491
	 flipper_length_mm : 436
	 bill_length_mm : 754
	 body_mass_g : 334
	 island : 332
	 sex : 31
	 year : 20


## Creating a model by hand

In this section we will create a small Random Forest model by hand. To make it
extra easy, the model will only contain one simple tree:

```
3 label classes: Red, blue and green.
2 features: f1 (numerical) and f2 (string categorical)

f1>=1.5
    ├─(pos)─ f2 in ["cat","dog"]
    │         ├─(pos)─ value: [0.8, 0.1, 0.1]
    │         └─(neg)─ value: [0.1, 0.8, 0.1]
    └─(neg)─ value: [0.1, 0.1, 0.8]
```

In [None]:
# Create the model builder
builder = tfdf.builder.RandomForestBuilder(
    path="/tmp/manual_model",
    objective=tfdf.py_tree.objective.ClassificationObjective(
        label="color", classes=["red", "blue", "green"]))

Each tree is added one by one.

**Note:** The tree object (`tfdf.py_tree.tree.Tree`) is the same as the one returned by `extract_tree()` in the previous section.

In [None]:
# So alias
Tree = tfdf.py_tree.tree.Tree
SimpleColumnSpec = tfdf.py_tree.dataspec.SimpleColumnSpec
ColumnType = tfdf.py_tree.dataspec.ColumnType
# Nodes
NonLeafNode = tfdf.py_tree.node.NonLeafNode
LeafNode = tfdf.py_tree.node.LeafNode
# Conditions
NumericalHigherThanCondition = tfdf.py_tree.condition.NumericalHigherThanCondition
CategoricalIsInCondition = tfdf.py_tree.condition.CategoricalIsInCondition
# Leaf values
ProbabilityValue = tfdf.py_tree.value.ProbabilityValue

builder.add_tree(
    Tree(
        NonLeafNode(
            condition=NumericalHigherThanCondition(
                feature=SimpleColumnSpec(name="f1", type=ColumnType.NUMERICAL),
                threshold=1.5,
                missing_evaluation=False),
            pos_child=NonLeafNode(
                condition=CategoricalIsInCondition(
                    feature=SimpleColumnSpec(name="f2",type=ColumnType.CATEGORICAL),
                    mask=["cat", "dog"],
                    missing_evaluation=False),
                pos_child=LeafNode(value=ProbabilityValue(probability=[0.8, 0.1, 0.1], num_examples=10)),
                neg_child=LeafNode(value=ProbabilityValue(probability=[0.1, 0.8, 0.1], num_examples=20))),
            neg_child=LeafNode(value=ProbabilityValue(probability=[0.1, 0.1, 0.8], num_examples=30)))))

Conclude the tree writing

In [None]:
builder.close()

We can then open the model as a regular keras model:

In [None]:
manual_model = tf.keras.models.load_model("/tmp/manual_model")

and make predictions

In [None]:
examples = tf.data.Dataset.from_tensor_slices({
        "f1": [1.0, 2.0, 3.0],
        "f2": ["cat", "cat", "bird"]
    }).batch(2)

predictions = manual_model.predict(examples)

print("predictions:\n",predictions)

predictions:
 [[0.1 0.1 0.8]
 [0.8 0.1 0.1]
 [0.1 0.8 0.1]]


Access the structure:

**Note:** Because the model is serialized-and-deserialized, you need to use an alternative but equivalent form.

In [None]:
yggdrasil_model_path = manual_model.yggdrasil_model_path_tensor().numpy().decode("utf-8")
print("yggdrasil_model_path:",yggdrasil_model_path)

inspector = tfdf.inspector.make_inspector(yggdrasil_model_path)
print("Input features:", inspector.features())

yggdrasil_model_path: /tmp/manual_model/assets/
Input features: ["f1" (1; #1), "f2" (4; #2)]


And of course, you can plot the model :)

In [None]:
tfdf.model_plotter.plot_model_in_colab(manual_model)