# Gradient Boosted Decision Tree (GBDT)
Implement a Gradient Boosted Decision Tree (GBDT) with TensorFlow. This example is using the Boston Housing Value dataset as training samples. The example supports both Classification (2 classes: value > $23000 or not) and Regression (raw home value as target).

- Original Author: Aymeric Damien
- Author: Ron Li
- Project: https://github.com/rongpenl/TensorFlow-Examples/

## Boston Housing Dataset

**Link:** https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

**Description:**

The dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.`

*For the full features list, please see the link above*

As Jan 13, 2021, upgraded the example with official Tensorflow tutorial [here](https://www.tensorflow.org/tutorials/estimator/boosted_trees_model_understanding).

In [1]:
# Ignore all GPUs (current TF GBDT does not support GPU).
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "1"

import tensorflow as tf
import numpy as np
import copy

In [121]:
# Dataset parameters.
num_classes = 2 # Total classes: greater or equal to $23,000, or not (See notes below).
num_features = 13 # data features size.

# Training parameters.
max_steps = 2000
batch_size = 256
learning_rate = 1.0
l1_regul = 0.0
l2_regul = 0.1

# GBDT parameters.
num_batches_per_layer = 1000
num_trees = 10
max_depth = 4

In [124]:
# Prepare Boston Housing Dataset.
from tensorflow.keras.datasets import boston_housing
import pandas as pd
tf.random.set_seed(123)

(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

y_train_binary = y_train >= 23.0
y_test_binary = y_test < 23.0

# Per the instruction of tensorflow v2, the input has to be a DataFrame.

x_train = pd.DataFrame(x_train, columns = ["x" + str(i) for i in range(x_train.shape[1])])
x_test = pd.DataFrame(x_test, columns = ["x" + str(i) for i in range(x_train.shape[1])])
y_train_binary = pd.DataFrame({"expensive": y_train_binary.astype(int)})
y_test_binary = pd.DataFrame({"expensive": y_test_binary.astype(int)})

In [114]:
# A legitmate input in input function.
x_train.to_dict(orient='list');

##  Create feature columns, input_fn, and the train the estimator

### Preprocess the data

GBDT Models from TF Estimator requires 'feature_column' data format.

Since all feature columns are numerical, let's create a `feature_columns` variable so keep track of them. Note that categorical variables are treated differently.

In [126]:
feature_columns = []
for i in range(x_train.shape[1]):
    feature_columns.append(tf.feature_column.numeric_column(key='x' + str(i), shape=(1,)))
len(feature_columns)

13

Build the input function.
This format is from tensorflow v2 now.
As you can see from the `help` example, `dictionary` structure is also preserved.

In [127]:
#help(tf.data.Dataset.from_tensor_slices);

In [128]:
#tf.data.Dataset.from_tensor_slices((x_train.to_dict(orient='list'), y_train)).batch(20).prefetch(1)

In [129]:
NUM_EXAMPLES = len(y_train)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
    if shuffle:
      dataset = dataset.shuffle(NUM_EXAMPLES)
    # For training, cycle thru dataset as many times as need (n_epochs=None).
    dataset = (dataset
      .repeat(n_epochs)
      .batch(NUM_EXAMPLES))
    return dataset
  return input_fn
train_input_fn = make_input_fn(x_train, y_train_binary)
test_input_fn = make_input_fn(x_test, y_test_binary, n_epochs = 1)
evaluate_train_input_fn = make_input_fn(x_train, y_train_binary, n_epochs = 1, shuffle = False)

### GBDT Classifier

In [133]:
params = {
  'n_trees': num_trees,
  'max_depth': max_depth,
  'n_batches_per_layer': num_batches_per_layer,
#   'center_bias': True,
    "feature_columns":feature_columns,
    "n_classes":num_classes,
    "learning_rate":learning_rate,
    "l1_regularization": l1_regul,
    "l2_regularization":l2_regul
}


In [134]:
%%capture
gbdt_classifier = tf.estimator.BoostedTreesClassifier(**params)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpv9co0ot5', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Train.

In [135]:
gbdt_classifier.train(train_input_fn, max_steps=max_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpv9co0ot5/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.6931474, step = 0
INFO:tensorflow:loss = 0.6931474, step = 0 (1.021 sec)
INFO:tensorflow:loss = 0.6931474, step = 0 (0.715 sec)
INFO:tensorflow:loss = 0.6931474, step = 0 (0.678 sec)
INFO:tensorflow:loss = 0.6931474, step = 0 (0.676 sec)
INFO:tensorflow:loss = 0.6931474, step = 0 (0.662 sec)
INFO:tensorflow:loss = 0.6931474, step = 0 (0.658 sec)


KeyboardInterrupt: 

Evaluate train input.

In [123]:
gbdt_classifier.evaluate(evaluate_train_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-01-13T11:49:48Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmprmu69gni/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.80092s
INFO:tensorflow:Finished evaluation at 2021-01-13-11:49:49
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.6311881, accuracy_baseline = 0.63118815, auc = 0.5, auc_precision_recall = 0.6844059, average_loss = 0.69230825, global_step = 2000, label/mean = 0.36881188, loss = 0.69230825, precision = 0.0, prediction/mean = 0.4983915, recall = 0.0
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmprmu69gni/model.ckpt-2000


{'accuracy': 0.6311881,
 'accuracy_baseline': 0.63118815,
 'auc': 0.5,
 'auc_precision_recall': 0.6844059,
 'average_loss': 0.69230825,
 'label/mean': 0.36881188,
 'loss': 0.69230825,
 'precision': 0.0,
 'prediction/mean': 0.4983915,
 'recall': 0.0,
 'global_step': 2000}

In [109]:
gbdt_classifier.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-01-13T11:43:32Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpxkg2xk8a/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.77667s
INFO:tensorflow:Finished evaluation at 2021-01-13-11:43:33
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.44117647, accuracy_baseline = 0.5588235, auc = 0.5, auc_precision_recall = 0.7794118, average_loss = 0.69353086, global_step = 2000, label/mean = 0.5588235, loss = 0.69353086, precision = 0.0, prediction/mean = 0.49839133, recall = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmpxkg2xk8a/model.ckpt-2000


{'accuracy': 0.44117647,
 'accuracy_baseline': 0.5588235,
 'auc': 0.5,
 'auc_precision_recall': 0.7794118,
 'average_loss': 0.69353086,
 'label/mean': 0.5588235,
 'loss': 0.69353086,
 'precision': 0.0,
 'prediction/mean': 0.49839133,
 'recall': 0.0,
 'global_step': 2000}

### GBDT Regressor

In [20]:
# Build the input function.
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_train}, y=y_train,
    batch_size=batch_size, num_epochs=None, shuffle=True)
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_test}, y=y_test,
    batch_size=batch_size, num_epochs=1, shuffle=False)
# GBDT Models from TF Estimator requires 'feature_column' data format.
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(num_features,))]

In [21]:
gbdt_regressor = tf.estimator.BoostedTreesRegressor(
    n_batches_per_layer=num_batches_per_layer,
    feature_columns=feature_columns, 
    learning_rate=learning_rate, 
    n_trees=num_trees,
    max_depth=max_depth,
    l1_regularization=l1_regul, 
    l2_regularization=l2_regul
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp859bmbqp', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [22]:
%%capture
gbdt_regressor.train(train_input_fn, max_steps=max_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp859bmbqp/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 596.60876, step = 0
INFO:tensorflow:loss = 607.56335, step = 0 (0.535 sec)
INFO:tensorflow:loss = 598.3287, step = 0 (0.233 sec)
INFO:tensorflow:loss = 600.50073, step = 0 (0.226 sec)
INFO:tensorflow:loss = 568.95374, step = 0 (0.254 sec)
INFO:tensorflow:loss = 575.83813, step = 0 (0.246 sec)
INFO:tensorflow:loss = 595.593, step = 0 (0.238 sec)
INFO:tensorflow:loss = 609.5604, step = 

In [23]:
gbdt_regressor.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-01-13T10:42:09Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp859bmbqp/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.49279s
INFO:tensorflow:Finished evaluation at 2021-01-13-10:42:09
INFO:tensorflow:Saving dict for global step 2000: average_loss = 29.68949, global_step = 2000, label/mean = 23.078432, loss = 29.68949, prediction/mean = 22.495186
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmp859bmbqp/model.ckpt-2000


{'average_loss': 29.68949,
 'label/mean': 23.078432,
 'loss': 29.68949,
 'prediction/mean': 22.495186,
 'global_step': 2000}