A lot of the code here was inspired from Paul Mooney's [notebook on TF-DF](https://www.kaggle.com/code/paultimothymooney/getting-started-with-tensorflow-decision-forests). Make sure to check that out as well.

# Introduction

A learning algorithm trains a machine learning model on a training dataset. The parameters of a learning algorithm–called "hyper-parameters"–control how the model is trained and impact its quality. Therefore, finding the best hyper-parameters is an important stage of modeling.

Automated tuning algorithms work by generating and evaluating a large number of hyper-parameter values. Each of those iterations is called a "trial". The evaluation of a trial is expensive as it requires to train a new model each time. At the end of the tuning, the hyper-parameter with the best evaluation is used.

To demonstrate automated hyper-parameter tuning in TF-DF we'll be working with the Tabular Playground Series Feb 2021 Kaggle Dataset. It is a tabular dataset with 300,000 rows and 26 columns in training (93.66 MiB .CSV training dataset + 58.85 MiB .CSV test set) that is suitable for training algorithms to solve regression problems.

We'll be predicting a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous.

By studying this tutorial you will learn how to quickly and automatically tune hyper-parameters of a GradientBoostedTrees model to perform a regression task using tabular data.

# Installing TensorFlow Decision Forests & Keras Tuner

In [None]:
# Display only the messages with ERROR, CRITICAL log levels
!pip install tensorflow_decision_forests -U -qq
# Install the Keras tuner
!pip install keras-tuner -U -qq

# Importing libraries

In [None]:
# Import Python packages
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import keras_tuner as kt
import tensorflow as tf
import tensorflow_decision_forests as tfdf
print("TensorFlow Decision Forests v" + tfdf.__version__)

# Helper functions

In [None]:
# Define helper functions for plotting training evaluation curves

def plot_tfdf_model_training_curves(model):
    # This function was adapted from the following tutorial:
    # https://www.tensorflow.org/decision_forests/tutorials/beginner_colab
    logs = model.make_inspector().training_logs()
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    # Plot RMSE vs number of trees
    plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
    plt.xlabel("Number of trees")
    plt.ylabel("RMSE (out-of-bag)")
    plt.show()

In [None]:
# print list of all data and files attached to this notebook
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the dataset and convert it in a tf.Dataset

The dataset contains a mix of numerical (e.g. cont0 - cont13) and categorical (e.g. cat0 - cat9) features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding or normalization. Also by default the task is set to Classification in TF-DF, we'll change that to Regression.

In [None]:
def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

In [None]:
# load to pandas dataframe (for data exploration)
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/test.csv')

# load to tensorflow dataset (for model training)
train_tfds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="target", task=tfdf.keras.Task.REGRESSION)
test_tfds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, task=tfdf.keras.Task.REGRESSION)

# Exploratory data analysis

In [None]:
# print column names
print(train_df.columns)

In [None]:
# preview first few rows of data
train_df.head(10)

In [None]:
# print basic summary statistics
train_df.describe()

In [None]:
# check for missing values
sns.heatmap(train_df.isnull(), cbar=False)

# Automatic hyper-parameter tuning

Hyper-paramter tuning is enabled by specifying the tuner constructor argument of the model. The tuner object contains all the configuration of the tuner (search space, optimizer, trial and objective).

In [None]:
def build_model(hp):
  """Creates a model."""

  model = tfdf.keras.GradientBoostedTreesModel(task=tfdf.keras.Task.REGRESSION,
      min_examples=hp.Choice("min_examples", [2, 5, 7, 10]),
      categorical_algorithm=hp.Choice("categorical_algorithm", ["CART", "RANDOM"]),
      max_depth=hp.Choice("max_depth", [4, 5, 6, 7]),
#       # The keras tuner convert automaticall boolean parameters to integers.
#       # Regression tasks currently does not support hessian optimization in TF-DF
#       # https://github.com/tensorflow/decision-forests/issues/116
#       use_hessian_gain=bool(hp.Choice("use_hessian_gain", [True, False])),
      shrinkage=hp.Choice("shrinkage", [0.02, 0.05, 0.10, 0.15]),
      num_candidate_attributes_ratio=hp.Choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0]),
  )

  # Optimize the model accuracy as computed on the validation dataset.
  model.compile(metrics=[tf.keras.metrics.RootMeanSquaredError()])
  return model

In addition to having a default set of hyper-parameters, TF-DF also provides you with a list of additional hyper-parameter choices to consider.

In [None]:
print(tfdf.keras.GradientBoostedTreesModel.predefined_hyperparameters())

However, here we'll not be using the default hyper-parameter choices and tune them to get the best results instead. Also by default the task is set to Classification in TF-DF, we'll change that to Regression.

In [None]:
keras_tuner = kt.RandomSearch(
    build_model,
    objective=kt.Objective("val_root_mean_squared_error", direction="min"),
    max_trials=10,
    overwrite=True,
    directory="/tmp/keras_tuning")

# Important: The tuning should not be done on the test dataset.

# Extract a validation dataset from the training dataset. The new training
# dataset is called the "sub-training-dataset".
sub_train_df, sub_valid_df = split_dataset(train_df)
sub_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_train_df, label="target", task=tfdf.keras.Task.REGRESSION)
sub_valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_valid_df, label="target", task=tfdf.keras.Task.REGRESSION)

# Tune the model
keras_tuner.search(sub_train_ds, validation_data=sub_valid_ds)

In [None]:
# Best hyper-parameters.
best_hyper_parameters = keras_tuner.get_best_hyperparameters()[0].values
print("Best hyper-parameters:", keras_tuner.get_best_hyperparameters()[0].values)

In [None]:
# Train the model
# # The keras tuner convert automaticall boolean parameters to integers.
# best_hyper_parameters["use_hessian_gain"] = bool(best_hyper_parameters["use_hessian_gain"])
best_model = tfdf.keras.GradientBoostedTreesModel(**best_hyper_parameters, task=tfdf.keras.Task.REGRESSION)
best_model.fit(train_tfds, verbose=2)

# Plot the model

We plot the evoluation of the best score during the training and then the tuning.

In [None]:
plot_tfdf_model_training_curves(best_model)

# Evaluate the model

For this dataset, submissions are scored on the root mean squared error. Hence we evaluate the model on that metrics.

In [None]:
best_model.compile(metrics=[tf.keras.metrics.RootMeanSquaredError()])

In [None]:
best_model.evaluate(train_tfds)

Variable importances generally indicates how much a variable contributes to the model predictions or quality. Variable importance SUM_SCORE is sum of the split scores using a specific feature. The larger, the most important.

In [None]:
best_model.summary()

# Make submission

In [None]:
sample_submission_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/sample_submission.csv')
sample_submission_df['target'] = best_model.predict(test_tfds)
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()