Tips:

* After installing software, restart the Python runtime using *Runtime -> Restart*.

* Should you need to reset your environment to a clean state, you can use *Runtime -> Disconnect and delete runtime*.



# GDG Ukraine 2022: Intro to TensorFlow Decision Forests

Welcome! Today, you'll gain hands-on experience training decision forests with TensorFlow. Tree-based models incuding random forests and gradient-boosted trees are  some of the most [popular](https://www.kaggle.com/kaggle-survey-2021) models used in [Kaggle](https://kaggle.com/) compeititions, and are a valuable tool to become familiar with, in addition to neural networks.

This notebook contains a tutorial and quick and exercise to help you get started. You'll train a random forest on a tabular dataset that you load from a CSV file. This is a common pattern in practice. As an exercise, you'll train a gradient boosted tree.

Okay, let's get started!

# Random Forests

## 🌲🌳🌲🌳🌲🐿️🐻

Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

You will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd
  
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")

model = tfdf.keras.RandomForestModel()
model.fit(tf_dataset)
  
print(model.summary())
```

## Install software

There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.

Today, you'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models at Google. The open-source release is currently in beta. 

If you use TF-DF your work, we would love to hear about it. And/or, if you encounter bugs or friction not mentioned in the release notes, please email Josh.

In [None]:
!pip install tensorflow_decision_forests --quiet

## Import the library

You may see a warnings about certain distrubuted training modes not being available during the beta. That's expected, and you can safely ignore these.

In [None]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd

In [None]:
print("TensorFlow v" + tf.__version__)
print("TensorFlow Decision Forests v" + tfdf.__version__)

## Download the penguins dataset

To start, you will work with a small tabular [dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) of about 300 penguins. You will predict the species of penguin (Adelie, Gentoo, or Chinstrap) based on numeric attributes like their flipper length, and categorical attributes like the name of the island they're found on.

In [None]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Display the first 3 examples
dataset_df.head(3)

## Prepare the dataset

This dataset contains a mix of numeric (*bill_depth_mm*), categorical (*island*) and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one of the advantages of tree-based models, and why they're a great place to start.

You will have to slightly adjusted the labels, though, to convert them into the integer format TF-DF expects. The label (species) is stored as a string, so let's convert that into an integer.


In [None]:
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

Next, split the dataset into training and testing:

In [None]:
import numpy as np

def split_dataset(dataset, test_ratio=0.30):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). We've provided a single line helper function that will do this for you: 

```
tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='species')
```

This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like GPUs and TPUs. It it not necessary for tree-based models until you begin to do distributed training - but we'll use it today for practice.

Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.

Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help.

In [None]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

## What models are available?

There are several tree-based models for you to choose from. To start, you'll work with a Random Forest. Thus is the most well-known of the Decision Forest training algorithms. 

A Random Forest is a collection of decision trees, each trained independently on a random subset of the training dataset (sampled with replacement). The algorithm is unique in that it is robust to overfitting, and easy to use.

In [None]:
tfdf.keras.get_all_models()

Unlike neural networks, decision forests have relatively few (and easy to configure) hyperparameters with good defaults.

## How can I configure them?

TF-DF provides good defaults for you (e.g. the top ranking hyperparameters on our benchmarks, slightly modified to run in reasonable time). You will use these defaults below. If you would like to configure the learning algorithm, you will find many options you can explore to get the highest possible accuracy. 

Let's check out the help on the ```RandomForestModel``` to see the options.

In [None]:
help(tfdf.keras.RandomForestModel)

There are **many** hyperparamters you can explore to grow exactly the type of forest you like. 


In [None]:
# You can the parameters as follows
print(tfdf.keras.RandomForestModel.predefined_hyperparameters())

You can select a template and/or set parameters as follows:

```gbt = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1",num_trees=300)```


## Create a Random Forest 

Today, you will use the defaults. Let's create your model. 

In [None]:
rf = tfdf.keras.RandomForestModel()
rf.compile(metrics=["accuracy"]) # Optional, you can use this to include a list of eval metrics

## Train your model

This is a one-liner.

Note: you may see a warning about Autograph. You can safely ignore this, it will be fixed in the next release.

In [None]:
rf.fit(x=train_ds)

## Visualize your model
One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below.

In [None]:
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

## Evaluate the model on OOB data and the test dataset

Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training.


In [None]:
import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.show()

You can also see some general stats on the OOB dataset:

In [None]:
inspector = rf.make_inspector()
inspector.evaluation()

Now, let's run an evaluation using the test data. Depending on the random split your accuracy will likely between 90-100%.

In [None]:
evaluation = rf.evaluate(x=test_ds,return_dict=True)

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

## Variable importances

There are several ways to identify important features. Let's see the available options:

In [None]:
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)

Let's display one of them:

In [None]:
inspector.variable_importances()["NUM_AS_ROOT"]

## Predict on a single example

Here's example code you can use to make predictions on a single example. Note that TensorFlow is optimized for batch prediction. This code below is mainly helpful for experimenting.

In [None]:
# Create your example as a dictionary
example = {"bill_depth_mm" : [0],
           "bill_length_mm" : [0],
           "body_mass_g" : [0],
           "flipper_length_mm" : [0],
           "island" : ["Torgersen"],
           "sex" : "female",
           "year" : 2007}

# Convert the dictionary into a DataFrame
example_df = pd.DataFrame.from_dict(example)

# Convert the DataFrame into tf.data format
example_ds = tfdf.keras.pd_dataframe_to_tf_dataset(example_df)

# Call predict
rf.predict(example_ds)

## Predict on many examples

Following is code you can use to display predictions for each example in the test set. Note that similar code will be a bit different for neural networks, which typically use a different data structure inside tf.data to pack the features and labels.

In [None]:
# Make predictions on every example in the test set
predictions = rf.predict(test_ds)

# Loop over the test set, and display the predicted value and label
features, labels = next(iter(test_ds))
for pred, label in zip(predictions, labels):
  print ("Pred:", np.argmax(pred), "Actual:", label.numpy())

# Exercise: Gradient Boosted Trees

In this exercise you will download the [census](https://archive.ics.uci.edu/ml/datasets/census+income) dataset which. This contains ~40K examples with a mix of numeric and categorical attributes. You will train a gradient boosted tree, identify important features, and evaluate your model's accuracy.

We've provided a bunch of code you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines.

Notes:
- You can visualize this dataset in your browser using https://pair-code.github.io/facets/ 
- This dataset has fairness concerns, which you can learn about at https://www.tensorflow.org/responsible_ai

### Instructions

Complete the code cells below. See the comments for instructions. You can find a solution at the end.

### Download and explore the dataset

In [None]:
# Download the dataset
!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data -O /tmp/adult.csv

In [None]:
# Take a look at the CSV
!head /tmp/adult.csv

In [None]:
# The CSV is missing a header
cols = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label'
]

df = pd.read_csv("/tmp/adult.csv", names=cols)

In [None]:
# Clean up

# Drop a meaningless column
df = df.drop(columns=['fnlwgt'])

# Convert the label into integer format
df["label"] = df["label"].apply(lambda x: 1 if x == ' >50K' else 0).values

# Shuffle the dataset
df = df.sample(frac=1).reset_index(drop=True)

In [None]:
df.info(verbose=True, show_counts=True)

In [None]:
label_col = 'label'
categorical_columns = list(df.select_dtypes(include='object').columns)
numeric_columns = [c for c in df.columns if c not in categorical_columns]

print('Categorical columns', categorical_columns)
print('Numeric columns', numeric_columns)

feature_columns = categorical_columns + numeric_columns

In [None]:
train_df, test_df = split_dataset(df)
print("{} examples in training, {} examples in testing.".format(
    len(train_df), len(test_df)))

In [None]:
train_df[numeric_columns].describe()

In [None]:
train_df[categorical_columns].nunique()

In [None]:
for col in categorical_columns:
  print(col, list(train_df[col].unique()))

What is the class balance?

In [None]:
train_df["label"].sum() / len(train_df)

In [None]:
test_df["label"].sum() / len(test_df)

In [None]:
print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)

Create tf.data.Datasets from the Pandas DataFrame, using the one-liner shown above.

In [None]:
# YOUR CODE HERE
# Add code to create a tf.data.Dataset for train and test from the DataFrames

# Example...
# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...
# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...

### Create and train your model

Create your model. You can write a one-liner for this, similar to the example above. Previously, you learned how to work with Random Forests. If you like, you can create a Random Forest again, or you can try creating a Gradient Boosted Tree. 

As a reminder, you can see which models are available by running `tfdf.keras.get_all_models()`, or visiting [tensorflow.org/decision_forests](https://www.tensorflow.org/decision_forests).

In [None]:
# YOUR CODE HERE
# Add code to create a gradient boosted tree
# Example ...
# gbt = tfdf.keras. ...
# gbt.compile(metrics=["accuracy"])

Train your model. You can write a one-liner for this, similar to the example above.

In [None]:
# YOUR CODE HERE
# Add code to train your model
# Example ...
# gbt.fit(...

### Evaluate your model

Uncomment these cells after completing the code above.

In [None]:
#gbt.summary()

In [None]:
#gbt.evaluate(test_ds)

In [None]:
# evaluation = gbt.evaluate(x=test_ds,return_dict=True)

# for name, value in evaluation.items():
#   print(f"{name}: {value:.4f}")

In [None]:
# inspector = gbt.make_inspector()
# inspector.evaluation()

In [None]:
# inspector.variable_importances()["NUM_AS_ROOT"]

In [None]:
# tfdf.model_plotter.plot_model_in_colab(gbt, tree_idx=0, max_depth=3)

### Solution


**Create tf.data.Datasets from the pd.DataFrame**

To do so, you can write:

```
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="label")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="label")
```

If you will be working with the same tf.data.Dataset multiple times, you can add `.cache()` at the end of those lines to keep it in memory.

**Create and train a GradientBoostedTrees model**

To do so, you can write:

```
gbt = tfdf.keras.GradientBoostedTreesModel()
gbt.compile(metrics=["accuracy"])
gbt.fit(train_ds)
```


### Next steps

**Try a larger dataset**

Thanks to our friends at Kaggle, you can find a tabular dataset with ~1.7M rows and starter code for TensorFlow Decision Forests [here](https://www.kaggle.com/code/paultimothymooney/getting-started-with-tensorflow-decision-forests/). This is great if you'd like to start running larger experiments.

**Hyperparameter tuning**

You can use [Keras Tuner](https://keras.io/keras_tuner/) for easy hyperparameter optmization. 