<a href="https://www.kaggle.com/rsizem2/tps-09-21-tensorflow-decision-forests?scriptVersionId=84694619" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# TPS 9/21 - TensorFlow Decision Forests

In this notebook we will use the relatively new TensorFlow [Decision Forests](https://www.tensorflow.org/decision_forests) library.  We get baselines for the [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel) and [Random Forest](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel) models. We mostly use default settings except for the following:

* We create a feature `nan_count` which is the sum of NAs in each row, based on [this discussion](https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/270206).
* For the Gradient Boosted Trees model, we set a high value for `num_trees` and enable early stopping to avoid overfitting on each fold.

**Note:** This library doesn't support many optimizations at this time and will take a couple hours to run.

In [1]:
# Global variables for testing changes to this notebook quickly
TRAIN_SIZE = 300000
NUM_FOLDS = 6
RANDOM_SEED = 0

In [2]:
# Install Tensorflow Decision Forests
!pip3 install -q tensorflow_decision_forests



## Imports

In [3]:
# Essentials
import numpy as np
import pandas as pd
import pyarrow
import time
import gc

# Hide warnings
import os
import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# Models and Evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import matplotlib.pyplot as plt

# set global seed for tensorflow
tf.random.set_seed(RANDOM_SEED)

## Load Data

In [4]:
%%time

# Load training data
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
train.drop('id', axis = 'columns', inplace = True)
train = train_test_split(
    train, 
    train_size = TRAIN_SIZE,
    random_state = RANDOM_SEED,
    stratify = train['claim'],
)[0]

# Load test data
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
test.drop('id', axis = 'columns', inplace = True)

# Create NaN features, get number of rows
train["nan_count"] = train.isnull().sum(axis=1)
test["nan_count"] = test.isnull().sum(axis=1)

# Downcast training data
for col, dtype in train.dtypes.iteritems():
    if dtype.name.startswith('int'):
        train[col] = pd.to_numeric(train[col], downcast ='integer')
    elif dtype.name.startswith('float'):
        train[col] = pd.to_numeric(train[col], downcast ='float')

# Downcast test data
for col, dtype in test.dtypes.iteritems():
    if dtype.name.startswith('int'):
        test[col] = pd.to_numeric(test[col], downcast ='integer')
    elif dtype.name.startswith('float'):
        test[col] = pd.to_numeric(test[col], downcast ='float')

# Get relevant features, load sample submission
gc.collect()
features = [x for x in train.columns if x not in ['id','claim']]
submission = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

CPU times: user 39.4 s, sys: 9.94 s, total: 49.3 s
Wall time: 59.8 s


# Models

Functions for training a model and generating predictions on the test data

## 1. Gradient Boosted Trees

In [5]:
def score_gradient_boosting():

    # Vectors to store predictions/scores
    X_test = test[features].to_numpy()
    test_preds = np.zeros((test.shape[0],))
    oof_preds = np.zeros((train.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    start = time.time()
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train["claim"])):
       
        # Training and Validation Sets
        X_train = train[features].iloc[train_idx].to_numpy()
        X_valid = train[features].iloc[valid_idx].to_numpy()
        y_train = train["claim"].iloc[train_idx].to_numpy()
        y_valid = train["claim"].iloc[valid_idx].to_numpy()
        
        # Define and train model
        model = tfdf.keras.GradientBoostedTreesModel(verbose = 1)
        model.compile(metrics=[tf.metrics.AUC()])
        model.fit(X_train, y_train, verbose = 1)
        
        # Get predictions
        valid_preds = model.predict(X_valid)[:,0]
        test_preds += model.predict(X_test)[:,0] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        print(f'Validation AUC: {round(scores[fold], 6)}.\n')
    
    end = time.time()
    print("Average AUC:", round(scores.mean(), 6))
    print("Worst AUC:", round(scores.min(), 6))
    

    return scores.mean(), test_preds, oof_preds, end-start

## 2. Random Forest

In [6]:
def score_random_forest():

    # Vectors to store predictions/scores
    X_test = test[features].to_numpy()
    test_preds = np.zeros((test.shape[0],))
    oof_preds = np.zeros((train.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    start = time.time()
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train["claim"])):
       
        # Training and Validation Sets
        X_train = train[features].iloc[train_idx].to_numpy()
        X_valid = train[features].iloc[valid_idx].to_numpy()
        y_train = train["claim"].iloc[train_idx].to_numpy()
        y_valid = train["claim"].iloc[valid_idx].to_numpy()
        
        # Define and train model
        model = tfdf.keras.RandomForestModel(verbose = 1)
        model.compile(metrics=[tf.metrics.AUC()])
        model.fit(X_train, y_train, verbose = 1)
        
        # Get predictions
        valid_preds = model.predict(X_valid)[:,0]
        test_preds += model.predict(X_test)[:,0] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        print(f'Validation AUC: {round(scores[fold], 6)}.\n')
    
    end = time.time()
    print("Average AUC:", round(scores.mean(), 6))
    print("Worst AUC:", round(scores.min(), 6))

    return scores.mean(), test_preds, oof_preds, end-start

# Training

## 1. Gradient Boosted Trees

In [7]:
gbdt_score, gbdt_preds, gbdt_oof, gbdt_time = score_gradient_boosting()

submission['claim'] = gbdt_preds
submission.to_csv('gbtree_submission.csv', index=False)

Use /tmp/tmpchzznjsn as temporary training directory
Starting reading the dataset
Dataset read in 0:00:18.603547
Training model
Model trained in 0:19:27.585488
Compiling model
Validation AUC: 0.812995.

Use /tmp/tmpjphj2ptg as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.778156
Training model
Model trained in 0:13:29.088239
Compiling model
Validation AUC: 0.807368.

Use /tmp/tmpce9o03xn as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.889591
Training model
Model trained in 0:15:35.797297
Compiling model
Validation AUC: 0.806764.

Use /tmp/tmprgjljkse as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.994699
Training model
Model trained in 0:12:23.342971
Compiling model
Validation AUC: 0.809504.

Use /tmp/tmpyvra1pqu as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.262989
Training model
Model trained in 0:18:04.284066
Compiling model
Validation A

## 2. Random Forest

In [8]:
rf_score, rf_preds, rf_oof, rf_time = score_random_forest()

submission['claim'] = rf_preds
submission.to_csv('randomforest_submission.csv', index=False)

Use /tmp/tmpuqvsn_d1 as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.997593
Training model
Model trained in 0:07:10.928583
Compiling model
Validation AUC: 0.805504.

Use /tmp/tmpxpp_hd5t as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.965623
Training model
Model trained in 0:07:09.252985
Compiling model
Validation AUC: 0.798638.

Use /tmp/tmpnigl86sg as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.306202
Training model
Model trained in 0:07:00.372862
Compiling model
Validation AUC: 0.79957.

Use /tmp/tmpwl5wk51x as temporary training directory
Starting reading the dataset
Dataset read in 0:00:14.329549
Training model
Model trained in 0:07:13.085498
Compiling model
Validation AUC: 0.801356.

Use /tmp/tmp3t7q5zv7 as temporary training directory
Starting reading the dataset
Dataset read in 0:00:12.662401
Training model
Model trained in 0:07:14.598818
Compiling model
Validation AU