In [None]:
import warnings
import time

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor

warnings.simplefilter(action='ignore', category=FutureWarning)
np.random.seed(314)
USING_GPU = True

## Introduction
We estimate that our submission would have been near the **25th percentile** of the leaderboard. Unfortunately a [bug](https://www.kaggle.com/c/bigquery-geotab-intersection-congestion/discussion/108640) in the test set renders direct comparisons to the leaderboard invalid. We ran another Kaggler's kernel that achieved an RMSE in the low 60s (25th percentile) on the original test set and it achieved an RMSE in the low 90s on the new test set. This notebook's submission also achieved an RMSE in the low 90s on the new test set, so we can be confident that our model would place similarly on the leaderboard. 

## Read and Organize Data
We'll use the cleaned data files that we created in the *geotab-exploration* notebook.

In [None]:
submission = pd.read_csv('../input/bigquery-geotab-intersection-congestion/sample_submission.csv')
labeled = pd.read_csv('../input/geotabprocessed/train_processed.csv').set_index('RowId')
test = pd.read_csv('../input/geotabprocessed/test_processed.csv').set_index('RowId').loc[:1920334, :]

In [None]:
target_vars = ['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 
               'TotalTimeStopped_p80', 'DistanceToFirstStop_p20', 
               'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']
num_vars = ['dist_to_5pm', 'dist_to_8am', 'latitude_dist', 'longitude_dist']
bool_vars = ['Weekend', 'is_Atlanta', 'is_Boston', 'is_Chicago', 'is_Philadelphia']
cat_vars = [var for var in labeled.columns if var not in target_vars+num_vars+bool_vars]

labeled[cat_vars] = labeled[cat_vars].astype('category')
test[cat_vars] = test[cat_vars].astype('category')
cat_idxs = [i for i, var in enumerate(labeled.drop(columns=target_vars).columns) if var in cat_vars]

In [None]:
mask = np.random.rand(len(labeled)) < 0.9
train = labeled[mask]
val = labeled[~mask]

In [None]:
def get_X_y(df):
    X = df.drop(columns=target_vars)
    y = df[target_vars]
    return X, y

## Hyperparameter Tuning
The [CatBoost documentation](https://catboost.ai/docs/concepts/parameter-tuning.html#trees-number__overfitting-detection-settings) also provides a guide to hyperparameter tuning. We'll use this and follow the widely accepted procedure for ensemble model tuning:
1. Start with a learning rate that is high enough to train quickly, but not so high as to significantly sacrifice performance. Usually a range of 0.05-0.3 is used here. I tried a few different ones and settled on 0.5.
2. Tune the tree-specific hyperparameters with early stopping via the overfitting detector.
3. Lower the learning rate (which will yield more iterations via the overfitting detector) until performance no longer improves or the increases in accuracy are too small to justify the increased training time. 

This is the common approach to random forest tuning, except we do not need to tune the number of trees. CatBoost's overfitting detection is the boosting equivalent of early stopping in deep learning. It stops iteratively creating trees when the validation accuracy has not increased for *early_stopping_rounds* iterations. This is helpful in hyperparameter tuning because the optimal number of iterations for a high variance hyperparameter value will be lower than for a low variance value. Therefore it maintains a valid comparison between each hyperparameter value trained on its corresponding optimal number of iterations.

Considering the fact that our dataset is large, I expect a high variance model to be optimal. If this is correct then the optimal number of iterations and tree depth would be high and the l2 regularization would be low. There are some hyperparameters where non-defaults are only recommended if one needs to decrease the variance of the model. I will leave these hyperparameters to their default setting. 

In [None]:
X_train, y_train = get_X_y(train)
X_val, y_val = get_X_y(val)

In [None]:
params = {'cat_features': cat_idxs, 
          'eval_metric': 'RMSE',
          'random_seed':314, 
          'one_hot_max_size':24, 
          'boosting_type':'Plain', 
          'bootstrap_type':'Bayesian',
          'max_ctr_complexity':2, 
          'iterations':10**5, 
          'learning_rate': 0.1,
         }
if USING_GPU:
    params['task_type'] = 'GPU'
    params['border_count'] = 254

In [None]:
def search(params, param_name, param_list, tune_var):
    '''
    Returns a dictionary of tested hyperparameter values and their corresponding scores.
    '''
    scores={}
    for val in param_list:
        params[param_name] = val
        catboost = CatBoostRegressor(**params).fit(
            X_train, y_train[tune_var], early_stopping_rounds=20, 
            eval_set=(X_val,y_val[tune_var]), plot=True)
        pred = catboost.predict(X_val)
        pred[pred < 0] = 0
        scores[val] = mean_squared_error(pred, y_val[tune_var])**0.5
    del params[param_name]
    del param_list
    sns.lineplot(x=list(scores.keys()), y=list(scores.values()), marker='o').set(xlabel='Hyperparameter Value', ylabel='RMSE');
    return scores.copy()

The above function was used to fine tune depth and l2_leaf_reg. TotalTimeStopped_p80 was used as the target variable during hyperparameter tuning. For brevity and clarity's sake we do not include fine tuning here. As we expected the optimal depth value of 10 is relatively high, Catboost's [documentation](https://catboost.ai/docs/concepts/parameter-tuning.html) recommends a setting between 6 and 10. The optimal l2_leaf_reg is in the lower-mid range, practitioners generallly search values between 1 and 40. 

## Generating Predictions

In [None]:
params['depth'] = 10
params['l2_leaf_reg'] = 18

In [None]:
def get_preds(var, params):
    catboost = CatBoostRegressor(**params).fit(
        X_train, y_train[var], eval_set=(X_val, y_val[var]), 
        early_stopping_rounds=50, plot=True, verbose=False)
    return catboost.predict(X_test)

In [None]:
X_labeled, y_labeled = get_X_y(labeled)
X_test, y_test = get_X_y(test)
preds = {}
for idx, var in enumerate(tqdm(target_vars)):
    preds[idx] = get_preds(var, params)

In [None]:
submission['Target'] = pd.DataFrame(preds).stack().to_frame().iloc[:,0].values
submission.loc[submission['Target'] < 0, 'Target'] = 0
submission.to_csv('my_submission.csv', index=False)

This submission would have been around the 25th percentile of the leaderboard. See the introduction for details.

## Visualizing Predictions

In [None]:
target_vars = ['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 
               'TotalTimeStopped_p80', 'DistanceToFirstStop_p20', 
               'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']
fig, ax = plt.subplots(nrows=6, ncols=2, figsize=(20,40))
bins = list(range(0, 200, 10))
for i, var in enumerate(target_vars):
    sns.distplot(y_labeled[var], bins=bins, kde=False, ax=ax[i, 0]).set_title(var + ' Training Set Labels')
    sns.distplot(preds[i], bins=bins, kde=False, ax=ax[i, 1]).set_title(var + ' Test Set Predictions')

Let's assume that the true labels of the test set and the labels of the training set share a somewhat similar distribution. We can see that our model predicts a reasonable distribution for all three percentiles of TotalTimeStopped. We cannot say the same with respect to the DistanceToFirstStop percentiles. Our model fails to capture the fact that the distance is [tweedie](https://en.wikipedia.org/wiki/Tweedie_distribution), or bimodal with a spike at 0 and a skewed right distribution centered around 50-60. Because decision trees are nonlinear, a properly tuned decision tree ensemble such as CatBoost should be able to effectively model this distribution. Finding the source of this issue would be the first place to start in future work. 

In future work we should try the either of the two following approaches:
1. Fine-tuning one set of parameters for TotalTimeStopped and another for DistanceToFirstStop. The failure to predict a bimodal distribution of DistanceToFirstStop could be due to the fact that we finetuned with respect to TotalTimeStopped variable type. 
2. A two step approach where the first model is a zero vs nonzero classifier and the second predicts the value of the examples predicted to be nonzero. 