<a href="https://www.kaggle.com/rsizem2/tps-09-21-tensorflow-decision-forests?scriptVersionId=84711232" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# TensorFlow Decision Forests

In this notebook we will use the relatively new TensorFlow [Decision Forests](https://www.tensorflow.org/decision_forests) library.  We get baselines for the [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel) and [Random Forest](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel) models. We mostly use default settings except for the following:

* We create a feature `nan_count` which is the sum of NAs in each row, based on [this discussion](https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/270206).
* For the Gradient Boosted Trees model, we set a high value for `num_trees` and enable early stopping to avoid overfitting on each fold.

**Note:** This library doesn't support many optimizations at this time and will take a couple hours to run.

In [1]:
# Global variables for testing changes to this notebook quickly
TRAIN_SIZE = 300000
NUM_FOLDS = 6
RANDOM_SEED = 0

In [2]:
# Install Tensorflow Decision Forests
!pip3 install -q tensorflow_decision_forests



## Imports

In [3]:
# Essentials
import numpy as np
import pandas as pd
import warnings
import time
import gc
import os

# Hide warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# Model selection and evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score

# Tensorflow
import tensorflow as tf
import tensorflow_decision_forests as tfdf
tf.random.set_seed(RANDOM_SEED)

## Load Data

In [4]:
def load_data(num_samples):
    train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
    train, valid = train_test_split(
        train, 
        train_size = num_samples,
        random_state = RANDOM_SEED,
        stratify = train['claim']
    )
    return train

def downcast(input_df):
    data = input_df.copy()
    for col, dtype in data.dtypes.iteritems():
        if dtype.name.startswith('int'):
            data[col] = pd.to_numeric(data[col], downcast ='integer')
        elif dtype.name.startswith('float'):
            data[col] = pd.to_numeric(data[col], downcast ='float')
    return data

In [5]:
%%time

# Load subset of training data
train = load_data(TRAIN_SIZE)
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

# Drop irrel columns
train.drop('id', axis = 'columns', inplace = True)
test.drop('id', axis = 'columns', inplace = True)

# Create NaN features, get number of rows
train["nan_count"] = train.isnull().sum(axis=1)
test["nan_count"] = test.isnull().sum(axis=1)

# Downcast training and test data
train = downcast(train)
test = downcast(test)
gc.collect()

# Get relevant features, load sample submission
features = [x for x in train.columns if x not in ['id','claim']]

CPU times: user 41 s, sys: 10.3 s, total: 51.3 s
Wall time: 1min 1s


# Models

Functions for training a model and generating predictions on the test data

## 1. Gradient Boosted Trees

In [6]:
def score_gradient_boosting():

    # Vectors to store predictions/scores
    X_test = test[features].to_numpy()
    test_preds = np.zeros((test.shape[0],))
    oof_preds = np.zeros((train.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train["claim"])):
       
        # Training and Validation Sets
        start = time.time()
        X_train = train[features].iloc[train_idx].to_numpy()
        X_valid = train[features].iloc[valid_idx].to_numpy()
        y_train = train["claim"].iloc[train_idx].to_numpy()
        y_valid = train["claim"].iloc[valid_idx].to_numpy()
        
        # Define and train model
        model = tfdf.keras.GradientBoostedTreesModel(verbose = 0)
        model.compile(metrics=[tf.metrics.AUC()])
        model.fit(X_train, y_train, verbose = 0)
        
        # Get predictions
        valid_preds = model.predict(X_valid)[:,0]
        test_preds += model.predict(X_test)[:,0] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} AUC: {round(scores[fold], 6)} in {round((end-start) / 60, 2)} minutes')
    
    print("\nAverage AUC:", round(scores.mean(), 6))
    print("Worst AUC:", round(scores.min(), 6))
    

    return scores.mean(), test_preds, oof_preds

## 2. Random Forest

In [7]:
def score_random_forest():

    # Vectors to store predictions/scores
    X_test = test[features].to_numpy()
    test_preds = np.zeros((test.shape[0],))
    oof_preds = np.zeros((train.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train["claim"])):
       
        # Training and Validation Sets
        start = time.time()
        X_train = train[features].iloc[train_idx].to_numpy()
        X_valid = train[features].iloc[valid_idx].to_numpy()
        y_train = train["claim"].iloc[train_idx].to_numpy()
        y_valid = train["claim"].iloc[valid_idx].to_numpy()
        
        # Define and train model
        model = tfdf.keras.RandomForestModel(verbose = 0)
        model.compile(metrics=[tf.metrics.AUC()])
        model.fit(X_train, y_train, verbose = 0)
        
        # Get predictions
        valid_preds = model.predict(X_valid)[:,0]
        test_preds += model.predict(X_test)[:,0] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} AUC: {round(scores[fold], 6)} in {round((end-start) / 60, 2)} minutes')
    
    print("\nAverage AUC:", round(scores.mean(), 6))
    print("Worst AUC:", round(scores.min(), 6))

    return scores.mean(), test_preds, oof_preds

# Training

## 1. Gradient Boosted Trees

In [8]:
gbdt_score, gbdt_preds, gbdt_oof = score_gradient_boosting()

submission['claim'] = gbdt_preds
submission.to_csv('gbtree_submission.csv', index=False)

Fold 0 AUC: 0.812995 in 20.56 minutes
Fold 1 AUC: 0.807368 in 15.15 minutes
Fold 2 AUC: 0.806764 in 17.15 minutes
Fold 3 AUC: 0.809504 in 13.19 minutes
Fold 4 AUC: 0.811193 in 18.95 minutes
Fold 5 AUC: 0.811461 in 15.18 minutes

Average AUC: 0.809881
Worst AUC: 0.806764


## 2. Random Forest

In [9]:
rf_score, rf_preds, rf_oof = score_random_forest()

submission['claim'] = rf_preds
submission.to_csv('randomforest_submission.csv', index=False)

Fold 0 AUC: 0.805504 in 10.08 minutes
Fold 1 AUC: 0.798638 in 10.25 minutes
Fold 2 AUC: 0.79957 in 10.18 minutes
Fold 3 AUC: 0.801356 in 10.3 minutes
Fold 4 AUC: 0.805246 in 10.04 minutes
Fold 5 AUC: 0.803748 in 10.14 minutes

Average AUC: 0.802344
Worst AUC: 0.798638
