#  TensorFlow's Decision Forests (TF-DF) algorithm

Author: __Thirada Tiamklang 14337188__

        AT3 - Data Product with Machine 
        
__Table of contents__
1. Load dataset
2. Train TensorFlow's Decision Forests (TF-DF) algorithm:
        2.1 Random forest model
        2.2 Gradient Boosted Trees
3. Reference

## 1. Load Dataset

In [1]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")

In [2]:
import sys
print(sys.path)

['/Users/thiradatiamklang/Desktop/flight-streamlit-at3/flight-prediction/notebooks/TT_notebooks', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python310.zip', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/lib-dynload', '', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages']


In [3]:
import sys
sys.path.append('../../src')


In [4]:
from data.make_dataset import load_sets
X_train, y_train, X_val, y_val, X_test, y_test = load_sets(path='../../data/processed/')

In [5]:
import pandas as pd
# Read a Feather file
df = pd.read_feather('../../data/processed/df_cleaned_select_cols.feather')

In [6]:
from data.make_dataset import pop_target
features, target = pop_target(df, 'totalFare')

In [7]:
y_train

array([-0.31391783, -0.50182474, -1.22936172, ..., -0.89209292,
        1.17970114, -0.04892092])

In [8]:
X_train

array([[ 1.07916650e+07, -9.95738027e-01,  1.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 8.53247500e+06, -7.88605694e-01,  1.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.91918400e+06, -5.74075777e-01,  1.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 1.18766450e+07, -4.60646166e-01,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.16629820e+07,  4.85011571e-01,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.58366700e+06,  1.26422368e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

## 2. Train TensorFlow's Decision Forests (TF-DF)

In [11]:
import tensorflow_decision_forests as tfdf
import tensorflow as tf
from sklearn.metrics import mean_squared_error

# Convert data to tf.data.Datasets
batch_size = 100

# Assuming X_train, y_train, X_val, y_val, X_test, and y_test are NumPy arrays or tensors
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_dataset = train_dataset.batch(batch_size)

validation_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
validation_dataset = validation_dataset.batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))
test_dataset = test_dataset.batch(batch_size)


### 2.1 Train Random Forest models

In [12]:
# Define the TensorFlow Decision Forest model
model = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)

# Compile the model
model.compile(metrics=["mean_squared_error"])

# Fit the model
model.fit(train_dataset, epochs=1, validation_data=validation_dataset)

Use /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmpxovdtnoa as temporary training directory
Reading training dataset...
Training dataset read in 0:00:09.935749. Found 8111999 examples.
Reading validation dataset...
Num validation examples: tf.Tensor(2704000, shape=(), dtype=int32)
Validation dataset read in 0:00:03.033918. Found 2704000 examples.
Training model...


[INFO 23-11-05 17:15:41.5144 AEDT kernel.cc:1233] Loading model from path /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmpxovdtnoa/model/ with prefix 50bb42066cbe42f4
[INFO 23-11-05 17:15:50.2774 AEDT decision_forest.cc:660] Model loaded with 300 root(s), 8111404 node(s), and 9 input feature(s).
[INFO 23-11-05 17:15:50.2775 AEDT abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 23-11-05 17:15:50.2775 AEDT kernel.cc:1061] Use fast generic engine


Model trained in 0:25:41.002621
Compiling model...
Model compiled.


<keras.src.callbacks.History at 0x178b63df0>

In [16]:
# Evaluate the model on train set
results = model.evaluate(train_dataset)



In [15]:
# Evaluate the model on test set
results = model.evaluate(test_dataset)

# Print evaluation results
mse = results[0]
print(f'Mean Squared Error: {mse}')

# Make predictions
predictions = model.predict(test_dataset)

Mean Squared Error: 0.0


__Note:__ It can be seen that the MSE score for the train and test sets is nearly the same with low overfitting, and the MSE score is approximately `0.3072`.

#### Model Architecture

In [22]:
# Define custom model architecture and hyperparameters
model_rf1 = tfdf.keras.RandomForestModel(
    task=tfdf.keras.Task.REGRESSION,
    num_trees=100,
    max_depth=10,     
)

# Compile the model
model_rf1.compile(metrics=["mean_squared_error"])

# Train the model
model_rf1.fit(train_dataset, epochs=1, validation_data=validation_dataset)

Use /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmpc_884n_k as temporary training directory
Reading training dataset...
Training dataset read in 0:00:08.813471. Found 8111999 examples.
Reading validation dataset...
Num validation examples: tf.Tensor(2704000, shape=(), dtype=int32)
Validation dataset read in 0:00:02.932822. Found 2704000 examples.
Training model...
Model trained in 0:03:54.328372
Compiling model...


[INFO 23-11-05 18:12:55.3693 AEDT kernel.cc:1233] Loading model from path /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmpc_884n_k/model/ with prefix f4e04b1c040b4de2
[INFO 23-11-05 18:12:55.4977 AEDT decision_forest.cc:660] Model loaded with 100 root(s), 96978 node(s), and 9 input feature(s).
[INFO 23-11-05 18:12:55.4977 AEDT abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 23-11-05 18:12:55.4977 AEDT kernel.cc:1061] Use fast generic engine


Model compiled.


<keras.src.callbacks.History at 0x3c76833d0>

In [30]:
# Evaluate the model on train set
results1 = model_rf1.evaluate(train_dataset)



In [29]:
# Evaluate the model on test set
results1 = model_rf1.evaluate(test_dataset)



__Note:__ It can be seen that the MSE score with the custom model architecture provides a worse score than the model with default architecture. 

### 2.2 Gradient Boosted Trees

In [26]:
# Define and train the Gradient Boosted Trees models
model_gbdt = tfdf.keras.GradientBoostedTreesModel(task=tfdf.keras.Task.REGRESSION)
model_gbdt.compile(metrics=["mean_squared_error"])
# fit model
model_gbdt.fit(train_dataset)

Use /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmp3z5s4ykg as temporary training directory
Reading training dataset...




Training dataset read in 0:00:08.345027. Found 8111999 examples.
Training model...
Model trained in 0:17:41.473159
Compiling model...


[INFO 23-11-05 18:49:06.3711 AEDT kernel.cc:1233] Loading model from path /var/folders/sb/nxrzyd0n61192x17k7wcyr4w0000gn/T/tmp3z5s4ykg/model/ with prefix 15e13e94eeaa4a9f
[INFO 23-11-05 18:49:06.4063 AEDT abstract_model.cc:1344] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 23-11-05 18:49:06.4063 AEDT kernel.cc:1061] Use fast generic engine


Model compiled.


<keras.src.callbacks.History at 0x3c7fa1e40>

In [28]:
# Evaluate the Gradient Boosted Trees models on training set
results_gbdt = model_gbdt.evaluate(train_dataset)



In [27]:
# Evaluate the Gradient Boosted Trees models on test set
results_gbdt = model_gbdt.evaluate(test_dataset)



__Note:__ We will define the Random Forest model as the best model, as it achieved a slightly better Mean Squared Error (MSE) compared to the `Gradient Boosted Trees (GBT) model`. We will use this `Random Forest model` to make predictions on new data in Streamlit.

__save the best model__

In [31]:
import joblib
# Replace 'model.joblib' with your desired filename
joblib.dump(model, '../../models/tfdf.joblib')

['../../models/tfdf.joblib']

## Reference

Anthony So. (2023). _36120_AdvMLA_Lab6_Exercise1_Solutions.ipynb_. Google Colab. https://colab.research.google.com/drive/1BD81KJ1ixR1Z-cE1GYJ3RFydkJzy3ZWY?authuser=1#scrollTo=Goi9jTI_B1KE

TensorFlow Decision Forests. (2023). _TensorFlow Decision Forests API documentation_. TensorFlow. https://www.tensorflow.org/decision_forests/api_docs/python/tfdf