# Finding Optimum XGBoost Parameters for Tabular For Homesite Competition
> Building on the work of the previous notebook, which referenced [Zach's notebook](https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0/blob/master/Tabular%20Notebooks/02_Ensembling.ipynb) to apply his techniques for permutation importance and ensemble learning to the Homesite Competition problem set, we will deep dive into optimizing the XGBoost parameters to see if an improved model can be generated, and what effect this has for our resulting ensemble model's predictions

- toc: true 
- badges: true
- comments: true
- categories: [kaggle, fastai]
- author: Nissan Dookeran
- image: images/chart-preview.png

## Introduction
Using the [code here] we will look at the optimal number of estimators, depth, and then combination of the two when using XGBoost to generate a model

Notes:
- Changed the categorize functions from [last notebook](https://redditech.github.io/team-fast-tabulous/kaggle/fastai/2021/06/27/Improving-Fastai-split-choices.html) to exclude any columns in y_names from being evaluated since these shouldn't be part of the model training as a parameter
- Added code to save all models to reuse in another notebook that submits to Kaggle for test evaluation

## Setup

>> Adding based on [tutorial](https://github.com/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb) for notebooks

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [27]:
!pip install -Uqq fastai

In [2]:
!pip install kaggle



In [3]:
from fastai.tabular.all import *

In [4]:
global gdrive #colab only code block
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))
!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Mounted at /content/gdrive


In [5]:
from kaggle import api

In [6]:
path = Path.cwd()
path.ls()

(#3) [Path('/content/.config'),Path('/content/gdrive'),Path('/content/sample_data')]

> Only run the next three lines the first time if in a local repository. This will prevent large training data files and model files being checked into Github

In [None]:
!touch .gitignore

In [None]:
!echo "_data" > .gitignore

In [None]:
!mkdir _data

In [None]:
os.chdir('_data')
Path.cwd()

Path('/mnt/d/Code/GitHub/team-fast-tabulous/_notebooks/_data')

> Back to it

In [7]:
os.chdir(path/"gdrive/MyDrive/Kaggle/") # colab only code
Path.cwd()

Path('/content/gdrive/MyDrive/Kaggle')

In [8]:
path = Path.cwd()/"homesite_competition_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('homesite-quote-conversion', path=path)
file_extract(path/"homesite-quote-conversion.zip")
file_extract(path/"train.csv.zip")
file_extract(path/"test.csv.zip")
path.ls()

homesite-quote-conversion.zip: Skipping, found more recently modified local copy (use --force to force download)


(#6) [Path('homesite-quote-conversion.zip'),Path('sample_submission.csv.zip'),Path('test.csv.zip'),Path('train.csv.zip'),Path('train.csv'),Path('test.csv')]

Settings

In [9]:
test_size = 0.3
y_block=CategoryBlock()
# n_estimators = [50, 100, 150, 200]
# max_depth = [2, 4, 6, 8]
n_estimators_range = range(50,400,50)
max_depth_range = range(1, 11, 2)
random_seed =42
n_splits = 10 # 10 folds
scoring = "roc_auc"

In [10]:
from sklearn.metrics import roc_auc_score
# valid_score = roc_auc_score(to_np(targs), to_np(preds[:,1]))
# valid_score

## The [GridsearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) functions

Customising functions from [this page](https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/) to work with our dataset

In [11]:
import xgboost as xgb
import matplotlib
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from matplotlib import pyplot

This function will tune the n_estimators only

**Notes**
- Used https://scikit-learn.org/stable/modules/model_evaluation.html#scoring to find correct `scoring` string
- Reading more on [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to understand if I used the `n_splits` correctly, I chose 10 based on [this article](https://machinelearningmastery.com/k-fold-cross-validation/)

In [12]:
# XGBoost, Tune n_estimators

def xgboost_tune_estimators(X, y, n_estimators_range, n_splits, random_seed, scoring):
    matplotlib.use('Agg')
    label_encoded_y = LabelEncoder().fit_transform(y)
    # grid search
    model = XGBClassifier(tree_method='gpu_hist', gpu_id=0, verbosity=2)
    param_grid = dict(n_estimators=n_estimators_range)
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_seed)
    grid_search = GridSearchCV(model, param_grid, scoring=scoring, n_jobs=-1, cv=kfold, verbose=4)
    grid_result = grid_search.fit(X, label_encoded_y)
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))
    # plot
    pyplot.errorbar(n_estimators_range, means, yerr=stds)
    pyplot.title("XGBoost n_estimators vs Log Loss")
    pyplot.xlabel('n_estimators')
    pyplot.ylabel('Log Loss')
    pyplot.savefig('n_estimators.png')

This function will tune the max_depth value only

In [13]:
# XGBoost, Tune max_depth
def xgboost_tune_max_depth(X,y, max_depth_range, n_splits, random_seed, scoring):
    matplotlib.use('Agg')
    # encode string class values as integers
    label_encoded_y = LabelEncoder().fit_transform(y)
    # grid search
    model = XGBClassifier(tree_method='gpu_hist', gpu_id=0, verbosity=2)
    print(max_depth_range)
    param_grid = dict(max_depth=max_depth_range)
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_seed)
    grid_search = GridSearchCV(model, param_grid, scoring=scoring, n_jobs=-1, cv=kfold, verbose=4)
    grid_result = grid_search.fit(X, label_encoded_y)
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))
    # plot
    pyplot.errorbar(max_depth_range, means, yerr=stds)
    pyplot.title("XGBoost max_depth vs Log Loss")
    pyplot.xlabel('max_depth')
    pyplot.ylabel('Log Loss')
    pyplot.savefig('max_depth.png')

This function will tune for both n_estimators and max_depth in combination (takes really long to run)

In [14]:
# XGBoost, Tune n_estimators and max_depth
def xgboost_tune_n_estimators_and_max_depth(X, y, n_estimators_range, max_depth_range, n_splits, random_seed, scoring):
    matplotlib.use('Agg')
    # encode string class values as integers
    label_encoded_y = LabelEncoder().fit_transform(y)
    # grid search
    model = XGBClassifier(tree_method='gpu_hist', gpu_id=0, verbosity=2)
    print(max_depth_range)
    param_grid = dict(max_depth=max_depth_range, n_estimators=n_estimators_range)
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_seed)
    grid_search = GridSearchCV(model, param_grid, scoring=scoring, n_jobs=-1, cv=kfold, verbose=4)
    grid_result = grid_search.fit(X, label_encoded_y)
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))
    # plot results
    scores = np.array(means).reshape(len(max_depth_range), len(n_estimators_range))
    for i, value in enumerate(max_depth_range):
      pyplot.plot(n_estimators_range, scores[i], label='depth: ' + str(value))
    pyplot.legend()
    pyplot.xlabel('n_estimators')
    pyplot.ylabel('Log Loss')
    pyplot.savefig('n_estimators_vs_max_depth.png')

### My useful functions

In [15]:
def reassign_to_categorical(field, df, y_names, continuous, categorical, triage):
  if ((df[field].isna().sum()==0) and (field not in y_names)):
    field_categories = df[field].unique()
    df[field] = df[field].astype('category')
    df[field].cat.set_categories(field_categories, inplace=True)
    if field in continuous: continuous.remove(field)
    if field not in categorical: categorical.append(field)
  else:
    if field in continuous: continuous.remove(field)
    if field in categorical: categorical.remove(field)
    triage.append(field)

  return df, continuous, categorical, triage

In [16]:
def categorize( df, y_names, cont_names, cat_names, triage, category_threshold):
  for field in df.columns:
    if ((len(df[field].unique()) <= category_threshold) and (type(df[field].dtype) != pd.core.dtypes.dtypes.CategoricalDtype)):
      reassign_to_categorical(field, df, y_names, cont_names, cat_names, triage)
  return df, cont_names, cat_names, triage

In [17]:
def homesite_prep(df_train, df_test, y_names):
    df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='boolean')
    df_train = df_train.set_index('QuoteNumber')
    df_test = df_test.set_index('QuoteNumber')
    df_train['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
    df_test['Original_Quote_Date'] = pd.to_datetime(df_test['Original_Quote_Date'])
    df_train = add_datepart(df_train, 'Original_Quote_Date')
    df_test = add_datepart(df_test, 'Original_Quote_Date')
    cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
    triage = L()
    df_train, cont_names, cat_names, triage = categorize(df_train, y_names, cont_names, cat_names, triage, 100)
    return df_train, df_test, cont_names, cat_names, triage

In [18]:
def find_y_columns(df_train, df_test):
    y_columns = df_train.columns.difference(df_test.columns)
    return y_columns

### Load the data

In [19]:
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head(2)

Unnamed: 0,QuoteNumber,Original_Quote_Date,QuoteConversion_Flag,Field6,Field7,Field8,Field9,Field10,Field11,Field12,CoverageField1A,CoverageField1B,CoverageField2A,CoverageField2B,CoverageField3A,CoverageField3B,CoverageField4A,CoverageField4B,CoverageField5A,CoverageField5B,CoverageField6A,CoverageField6B,CoverageField8,CoverageField9,CoverageField11A,CoverageField11B,SalesField1A,SalesField1B,SalesField2A,SalesField2B,SalesField3,SalesField4,SalesField5,SalesField6,SalesField7,SalesField8,SalesField9,SalesField10,SalesField11,SalesField12,...,GeographicField44A,GeographicField44B,GeographicField45A,GeographicField45B,GeographicField46A,GeographicField46B,GeographicField47A,GeographicField47B,GeographicField48A,GeographicField48B,GeographicField49A,GeographicField49B,GeographicField50A,GeographicField50B,GeographicField51A,GeographicField51B,GeographicField52A,GeographicField52B,GeographicField53A,GeographicField53B,GeographicField54A,GeographicField54B,GeographicField55A,GeographicField55B,GeographicField56A,GeographicField56B,GeographicField57A,GeographicField57B,GeographicField58A,GeographicField58B,GeographicField59A,GeographicField59B,GeographicField60A,GeographicField60B,GeographicField61A,GeographicField61B,GeographicField62A,GeographicField62B,GeographicField63,GeographicField64
0,1,2013-08-16,0,B,23,0.9403,0.0006,965,1.02,N,17,23,17,23,15,22,16,22,13,22,13,23,T,D,2,1,7,18,3,8,0,5,5,24,V,48649,0,0,0,0,...,8,4,20,22,10,8,6,5,15,13,19,18,16,14,21,23,21,23,16,11,22,24,7,14,-1,17,15,17,14,18,9,9,-1,8,-1,18,-1,10,N,CA
1,2,2014-04-22,0,F,7,1.0006,0.004,548,1.2433,N,6,8,6,8,5,7,5,8,13,22,13,23,T,E,5,9,5,14,6,18,1,5,5,11,P,26778,0,0,1,1,...,23,24,11,15,21,24,6,11,21,21,18,15,20,20,13,12,12,12,15,9,13,11,11,20,-1,9,18,21,8,7,10,10,-1,11,-1,17,-1,20,N,NJ


In [20]:
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head(2)

Unnamed: 0,QuoteNumber,Original_Quote_Date,Field6,Field7,Field8,Field9,Field10,Field11,Field12,CoverageField1A,CoverageField1B,CoverageField2A,CoverageField2B,CoverageField3A,CoverageField3B,CoverageField4A,CoverageField4B,CoverageField5A,CoverageField5B,CoverageField6A,CoverageField6B,CoverageField8,CoverageField9,CoverageField11A,CoverageField11B,SalesField1A,SalesField1B,SalesField2A,SalesField2B,SalesField3,SalesField4,SalesField5,SalesField6,SalesField7,SalesField8,SalesField9,SalesField10,SalesField11,SalesField12,SalesField13,...,GeographicField44A,GeographicField44B,GeographicField45A,GeographicField45B,GeographicField46A,GeographicField46B,GeographicField47A,GeographicField47B,GeographicField48A,GeographicField48B,GeographicField49A,GeographicField49B,GeographicField50A,GeographicField50B,GeographicField51A,GeographicField51B,GeographicField52A,GeographicField52B,GeographicField53A,GeographicField53B,GeographicField54A,GeographicField54B,GeographicField55A,GeographicField55B,GeographicField56A,GeographicField56B,GeographicField57A,GeographicField57B,GeographicField58A,GeographicField58B,GeographicField59A,GeographicField59B,GeographicField60A,GeographicField60B,GeographicField61A,GeographicField61B,GeographicField62A,GeographicField62B,GeographicField63,GeographicField64
0,3,2014-08-12,E,16,0.9364,0.0006,1487,1.3045,N,4,4,4,4,3,3,3,4,13,22,13,23,Y,K,13,22,6,16,9,21,0,5,5,11,P,67052,0,0,0,0,0,...,22,23,9,12,25,25,6,9,4,2,16,12,20,20,2,2,2,1,1,1,10,7,25,25,-1,19,19,22,12,15,1,1,-1,1,-1,20,-1,25,Y,IL
1,5,2013-09-07,F,11,0.9919,0.0038,564,1.1886,N,8,14,8,14,7,12,8,13,13,22,13,23,T,E,4,5,3,6,3,6,1,5,5,4,R,27288,1,0,0,0,0,...,23,24,12,21,23,25,7,11,16,14,13,6,17,15,7,5,7,5,13,7,14,14,7,14,-1,4,1,1,5,3,10,10,-1,5,-1,5,-1,21,N,NJ


In [21]:
y_names = find_y_columns(df_train, df_test)[0]
df_train, df_test, cont_names, cat_names, triage = homesite_prep(df_train, df_test, y_names)

In [22]:
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)

In [23]:
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names, 
                   cont_names=cont_names, y_names=y_names,splits=splits,
                  y_block=y_block)

In [24]:
%%timeit
xgboost_tune_estimators(to.xs, to.ys.values.ravel(), n_estimators_range, n_splits, random_seed, scoring)

Fitting 10 folds for each of 7 candidates, totalling 70 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:  5.2min finished


Best: 0.962075 using {'n_estimators': 350}
0.949272 (0.000827) with: {'n_estimators': 50}
0.956429 (0.000913) with: {'n_estimators': 100}
0.958695 (0.000955) with: {'n_estimators': 150}
0.960027 (0.000919) with: {'n_estimators': 200}
0.960983 (0.000810) with: {'n_estimators': 250}
0.961640 (0.000799) with: {'n_estimators': 300}
0.962075 (0.000810) with: {'n_estimators': 350}


In [25]:
%%timeit
xgboost_tune_max_depth(to.xs,to.ys.values.ravel(), max_depth_range, n_splits, random_seed, scoring)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.68 µs
range(1, 11, 2)
Fitting 10 folds for each of 5 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   57.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.1min finished


Best: 0.962843 using {'max_depth': 9}
0.921006 (0.001579) with: {'max_depth': 1}
0.956429 (0.000913) with: {'max_depth': 3}
0.960726 (0.000844) with: {'max_depth': 5}
0.962370 (0.000775) with: {'max_depth': 7}
0.962843 (0.000782) with: {'max_depth': 9}


In [26]:
%%timeit
xgboost_tune_n_estimators_and_max_depth(to.xs, to.ys.values.ravel(), n_estimators_range, max_depth_range, n_splits, random_seed, scoring)

range(1, 11, 2)
Fitting 10 folds for each of 35 candidates, totalling 350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   51.2s
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 217 tasks      | elapsed: 15.2min
[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed: 40.4min finished


Best: 0.963909 using {'max_depth': 5, 'n_estimators': 350}
0.891124 (0.002553) with: {'max_depth': 1, 'n_estimators': 50}
0.921006 (0.001579) with: {'max_depth': 1, 'n_estimators': 100}
0.932049 (0.000957) with: {'max_depth': 1, 'n_estimators': 150}
0.937615 (0.000854) with: {'max_depth': 1, 'n_estimators': 200}
0.940647 (0.000889) with: {'max_depth': 1, 'n_estimators': 250}
0.943563 (0.000848) with: {'max_depth': 1, 'n_estimators': 300}
0.945651 (0.000866) with: {'max_depth': 1, 'n_estimators': 350}
0.949272 (0.000827) with: {'max_depth': 3, 'n_estimators': 50}
0.956429 (0.000913) with: {'max_depth': 3, 'n_estimators': 100}
0.958695 (0.000955) with: {'max_depth': 3, 'n_estimators': 150}
0.960027 (0.000919) with: {'max_depth': 3, 'n_estimators': 200}
0.960983 (0.000810) with: {'max_depth': 3, 'n_estimators': 250}
0.961640 (0.000799) with: {'max_depth': 3, 'n_estimators': 300}
0.962075 (0.000810) with: {'max_depth': 3, 'n_estimators': 350}
0.956793 (0.000763) with: {'max_depth': 5, 'n_e

As we can see the most optimal parameters using XGBoost would be to do `n_estimators=350` with `max_depth=5`. Let's run both and compare results side by side

In [28]:
n_estimators_original = 100
max_depth_original = 8
n_estimators_recommended = 350
max_depth_recommended = 5
learning_rate = 0.1
subsample = 0.5

In [29]:
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_valid, y_valid = to.valid.xs, to.valid.ys.values.ravel()

In [40]:
model_original = xgb.XGBClassifier(n_estimators = n_estimators_original, max_depth=max_depth_original, learning_rate=learning_rate, subsample=subsample, 
                                   tree_method='gpu_hist', gpu_id=0, verbosity=2)
xgb_model_original = model_original.fit(X_train, y_train)
xgb_preds_original = xgb_model_original.predict_proba(X_valid)



In [43]:
model_recommended = xgb.XGBClassifier(n_estimators = n_estimators_recommended, max_depth=max_depth_recommended, learning_rate=learning_rate, subsample=subsample,
                                      tree_method='gpu_hist', gpu_id=0, verbosity=2)
xgb_model_recommended = model_recommended.fit(X_train, y_train)
xgb_preds_recommended = xgb_model_recommended.predict_proba(X_valid)

In [45]:
accuracy(tensor(xgb_preds_original), tensor(y_valid)), accuracy(tensor(xgb_preds_recommended), tensor(y_valid))

(TensorBase(0.9243), TensorBase(0.9244))

So as we can see, we get a slightly better performance when we tune the `n_estimators` and `max_depth` values based on the recommendation