Tabular Playground Series Jan2021

XGBoost VS XGBoost + Hyperopt

The January 2021 Tabular Playground Series competition is almost over and thanks to all the people who have posted great notebooks with a lot of great ideas. One question I usually think about is, "How good is best?". It is said that time is money, so how much time (or money) should a person spend to make a job as good as relatively possible? Unfortunately, this is a difficult question to answer but should probably always be asked.

So, given the above paragraph, suppose you are asked by your boss to predict thousands of outcomes from some data that is circulating around the office (think the data in the Tabular Playground Series competition!), what to do? Fortunately, your co-workers have been busy on the problem and they tell you that XGBoost on the raw data should work fine. XGBoost sounds good to you, but how much extra work would it be to use Hyperopt to make the results potentially better? Is the extra effort worth it?

The information provided in this notebook will not provide definitive answers to the above questions. The idea is, why not give it a go and see for yourself? Here is what I tried, starting with notebook code generously posted by:
https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost
and
https://www.kaggle.com/marionhesse/hyperopt-xgboost-parameter-tuning
Thanks for the great posts!
As with the above notebooks, this notebook is released under the Apache 2.0 open source license. http://www.apache.org/licenses/LICENSE-2.0

Below is the XGBoost plus Hyperopt Run...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


import matplotlib.pyplot as plt
import gc #garbage collection
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope
from tqdm import tqdm

path = '../input/tabular-playground-series-jan-2021/'
train = pd.read_csv(path+'train.csv')

features = [col for col in train.columns if 'cont' in col]
label = 'target'


# train/eval set split
X_train,X_valid, y_train,y_valid = train_test_split(train[features],train[label],test_size=0.2)

d_tr = xgb.DMatrix(X_train, y_train)
d_val = xgb.DMatrix(X_valid,y_valid)


params_base = {'objective': 'reg:squarederror',
               'tree_method': 'gpu_hist',
               'random_state': 0}
base_model = xgb.train(params = params_base,
                       dtrain = d_tr,
                       num_boost_round = 1500,
                       evals = [(d_val,'eval')],
                       early_stopping_rounds=30,
                       verbose_eval = 20)
y_pred_base = base_model.predict(d_val)
base_score = mean_squared_error(y_valid, y_pred_base,squared=False)
print(base_score)


# Simple Cross Val score as function to be optimized

def score(params):
    
    ps = {'learning_rate': params['learning_rate'],
         'max_depth': params['max_depth'], 
         'gamma': params['gamma'], 
         'min_child_weight': params['min_child_weight'], 
         'subsample': params['subsample'], 
         'colsample_bytree': params['colsample_bytree'], 
         'verbosity': 1, 
         'objective': 'reg:squarederror',
         'eval_metric': 'rmse', 
         'tree_method': 'gpu_hist', 
         'random_state': 27,
        }
    model = xgb.train(ps,d_tr, params['n_round'], [(d_val, 'eval')], early_stopping_rounds=10, verbose_eval = False)
    y_pred = model.predict(d_val)
    score = mean_squared_error(y_valid, y_pred,squared=False)

    return score


# Define parameter space
param_space = {'learning_rate': hp.uniform('learning_rate', 0.01, 0.3), 
               'n_round': scope.int(hp.quniform('n_round', 200, 3000, 100)),
               'max_depth': scope.int(hp.quniform('max_depth', 5, 16, 1)), 
               'gamma': hp.uniform('gamma', 0, 10), 
               'min_child_weight': hp.uniform('min_child_weight', 0, 10),
               'subsample': hp.uniform('subsample', 0.1, 1), 
               'colsample_bytree': hp.uniform('colsample_bytree', 0.1, 1)
              }


# Run optimiser with tpe
%time
trials = Trials()
 
hopt = fmin(fn = score,
            space = param_space, 
            algo = tpe.suggest, 
            max_evals = 1200, 
            trials = trials, 
           )


params_best = hopt
params_best['max_depth'] = int(hopt['max_depth'])
n_rounds_best = int(hopt['n_round'])
del params_best['n_round']
print(params_best)
print(n_rounds_best)


%time
# Train with full dataset and best params
params_best['tree_method'] = 'gpu_hist'
d = xgb.DMatrix(train[features], train[label])
xgb_final = xgb.train(params_best,d,n_rounds_best)


y_pred_final = xgb_final.predict(d)
score_final = np.sqrt(mean_squared_error(train[label], y_pred_final))
print(score_final) #sanity check


# Load test data
test = pd.read_csv(path + 'test.csv')
test.set_index('id',drop=True,inplace=True)
d_tst = xgb.DMatrix(test[features])


# Predictions for test data
models = []

for seed in range(0,10):
    params_best['seed'] = seed
    xgb_final = xgb.train(params_best,d,num_boost_round = n_rounds_best)
    models.append(xgb_final)
    
xgb_pred = xgb_final.predict(d_tst)


# Save test predictions to file
ids = test.index
output = pd.DataFrame({'id': ids,
                       'target': xgb_pred})
output.to_csv('submission.csv', index=False)



