In [48]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit

from xgboost import XGBRegressor

# A first XGboost model

Here I train and tune an XGboost model on the first set of features I've extracted. These features are very basic, and don't involve any great insight, so I don't expect the results to be competitive. This notebook mainly serves as a baseline. 

## 1. Load data

Loading the training and test data:


In [10]:
X = np.load("../data/first_features/X.npy")
y = np.load("../data/first_features/y.npy")
X_test =  np.load("../data/first_features/X_test.npy")

In [3]:
# sanity check - do the shapes line up? 
print(X.shape)
print(y.shape)
print(X_test.shape)

(1503424, 107)
(1503424,)
(508438, 107)


Now, we can change the type of these matrices to save some memory

In [59]:
X.nbytes # 1.2 gig

1286930944

In [61]:
X.astype(np.float32).nbytes #.64 gig

643465472

In [62]:
X = X.astype(np.float32)
X_test = X_test.astype(np.float32)

## 2. Set up train/validation splits

For the purpose of hyperparameter tuning. 

Instead of actually splitting up the data `X` into seperate training and vaidation splits, I'll just save the indecies of the training and validation sets. 

In [27]:
validation_idx = np.repeat(-1, y.shape)
np.random.seed(109)
validation_idx[np.random.choice(validation_idx.shape[0], 
       int(round(.15*validation_idx.shape[0])), replace = False)] = 0

In [33]:
# now, we have an integer vector, where `-1` is training, and `0` is validation
print(np.sum(validation_idx == 0))
print(np.mean(validation_idx == 0))

225514
0.15000026605934189


## 3. Approximate model complexity

The first thing I'll do is get a feel for how complex a model I should be using. to do this, I'll define a grid with the parameters `n_estimators` and `max_depth` with very large intervals - to get a feel for the order of magnitude I should be using. 

In [14]:
grid1 = {"n_estimators" : [500, 1500, 5000], 
        "max_depth" : [6, 18, 56]}

Now, create a predefined validation split to be based to a `GridSearchCV` object. 

In [34]:
validation_split = list(PredefinedSplit(validation_idx).split())

In [66]:
model1 = XGBRegressor(random_state=0)

In [77]:
gs1 = GridSearchCV(model1, grid1,
           n_jobs=-1, 
           cv=2,
           verbose=3)

NTS: started 11:15

In [None]:
gs1.fit(X, y)

Fitting 2 folds for each of 9 candidates, totalling 18 fits
[CV] max_depth=6, n_estimators=500 ...................................
[CV] max_depth=6, n_estimators=500 ...................................
[CV] max_depth=6, n_estimators=1500 ..................................
[CV] max_depth=6, n_estimators=1500 ..................................
