In [48]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit

from xgboost import XGBRegressor

# A first XGboost model

Here I train and tune an XGboost model on the first set of features I've extracted. These features are very basic, and don't involve any great insight, so I don't expect the results to be competitive. This notebook mainly serves as a baseline. 

## 1. Load data

Loading the training and test data:


In [10]:
X = np.load("../data/first_features/X.npy")
y = np.load("../data/first_features/y.npy")
X_test =  np.load("../data/first_features/X_test.npy")

In [3]:
# sanity check - do the shapes line up? 
print(X.shape)
print(y.shape)
print(X_test.shape)

(1503424, 107)
(1503424,)
(508438, 107)


Now, we can change the type of these matrices to save some memory

In [59]:
X.nbytes # 1.2 gig

1286930944

In [61]:
X.astype(np.float32).nbytes #.64 gig

643465472

In [62]:
X = X.astype(np.float32)
X_test = X_test.astype(np.float32)

## 2. Set up train/validation splits

For the purpose of hyperparameter tuning. 

Instead of actually splitting up the data `X` into seperate training and vaidation splits, I'll just save the indecies of the training and validation sets. 

In [27]:
validation_idx = np.repeat(-1, y.shape)
np.random.seed(109)
validation_idx[np.random.choice(validation_idx.shape[0], 
       int(round(.15*validation_idx.shape[0])), replace = False)] = 0

In [33]:
# now, we have an integer vector, where `-1` is training, and `0` is validation
print(np.sum(validation_idx == 0))
print(np.mean(validation_idx == 0))

225514
0.15000026605934189


## 3. Approximate model complexity

The first thing I'll do is get a feel for how complex a model I should be using. to do this, I'll define a grid with the parameters `n_estimators` and `max_depth` with very large intervals - to get a feel for the order of magnitude I should be using. 

In [79]:
grid1 = {"n_estimators" : [100, 500, 1500], 
        "max_depth" : [6, 18, 36]}

Now, create a predefined validation split to be based to a `GridSearchCV` object. 

In [80]:
validation_split = list(PredefinedSplit(validation_idx).split())

In [81]:
model1 = XGBRegressor(random_state=0)

In [82]:
gs1 = GridSearchCV(model1, grid1,
           n_jobs=-1, 
           cv=2,
           verbose=3)

In [84]:
%time gs1.fit(X, y)

Fitting 2 folds for each of 9 candidates, totalling 18 fits
[CV] max_depth=6, n_estimators=100 ...................................
[CV] max_depth=6, n_estimators=100 ...................................
[CV] max_depth=6, n_estimators=500 ...................................
[CV] max_depth=6, n_estimators=500 ...................................
[CV]  max_depth=6, n_estimators=100, score=0.21739099778745896, total=19.3min
[CV] max_depth=6, n_estimators=1500 ..................................
[CV]  max_depth=6, n_estimators=100, score=0.21887805484771763, total=19.5min
[CV] max_depth=6, n_estimators=1500 ..................................
[CV]  max_depth=6, n_estimators=500, score=0.24357616017622352, total=89.2min
[CV] max_depth=18, n_estimators=100 ..................................
[CV]  max_depth=6, n_estimators=500, score=0.2458863049923381, total=89.8min
[CV] max_depth=18, n_estimators=100 ..................................
[CV]  max_depth=18, n_estimators=100, score=0.225466516065426

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 1054.1min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 1054.1min finished


CPU times: user 3h 26min 52s, sys: 35.7 s, total: 3h 27min 27s
Wall time: 20h 59min 58s


GridSearchCV(cv=2, error_score='raise',
       estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [100, 500, 1500], 'max_depth': [6, 18, 36]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [86]:
gs1.best_params_

{'max_depth': 6, 'n_estimators': 1500}

In [88]:
gs1.best_score_

0.25438806284040977