<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Extreme Gradient Boosting with Box Score Features</h4>
    <h5 style="font-weight: bold; font-size: 24px;">Hyperparameter Tuning using Expanding Window</h5>
    <p style="font-size: 20px;">NBA API Seasons 2021-22 to 2023-24</p>
</div>

<a name="Models"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

[Inspect Expanding Training Window](#Inspect-Training-Windows)

[Functions](#Functions)

**[1. Target: Total Points (over / under)](#1.-Target:-Total-Points-(over-/-under))**
  
**[2. Target: Difference in Points (plus / minus)](#2.-Target:-Difference-in-Points-(plus-/-minus))**

**[3. Target: Game Winner (moneyline)](#3.-Target:-Game-Winner-(moneyline))**

# Setup

[Return to top](#Models)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Data

[Return to top](#Models)

Data splits:

- Define NBA Season 2021-22 as the TRAINING set: regular season is 2021-10-19 to 2022-04-10. 
- Define NBA Season 2022-23 as the VALIDATION set: regular season is 2022-10-18 to 2023-04-09.
- Define NBA Season 2023-24 as the TESTING set: regular season is 2023-10-24 to 2024-04-14.

In [2]:
# load, filter (by time) and scale data
pts_scaled_df, pm_scaled_df, res_scaled_df, test_set_obs = utl.load_and_scale_data(
    file_path='../../data/processed/nba_team_matchups_rolling_box_scores_2022_2024_r05.csv',
    seasons_to_keep=['2021-22', '2022-23', '2023-24'], 
    training_season='2021-22',
    feature_prefix='ROLL_',
    scaler_type='minmax', 
    scale_target=False
)

Season 2021-22: 1186 games
Season 2022-23: 1181 games
Season 2023-24: 692 games
Total number of games across sampled seasons: 3059 games


In [3]:
# define number of games in seasons
season_22_ngames = 1186
season_23_ngames = 1181

In [4]:
pts_scaled_df.head()

Unnamed: 0_level_0,ROLL_HOME_PTS,ROLL_HOME_FGM,ROLL_HOME_FGA,ROLL_HOME_FG_PCT,ROLL_HOME_FG3M,ROLL_HOME_FG3A,ROLL_HOME_FG3_PCT,ROLL_HOME_FTM,ROLL_HOME_FTA,ROLL_HOME_FT_PCT,ROLL_HOME_OREB,ROLL_HOME_DREB,ROLL_HOME_REB,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF,TOTAL_PTS
GAME_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
2021-10-23,0.745,0.522,0.296,0.753,0.758,0.58,0.731,0.805,0.878,0.535,0.571,0.292,0.478,0.612,1.0,1.0,0.6,0.661,0.577,0.586,0.202,0.704,0.526,0.176,1.0,0.336,0.285,0.603,0.25,0.369,0.291,0.5,0.28,0.5,0.391,0.5,185
2021-10-23,0.0,0.0,0.648,0.0,0.076,0.412,0.0,0.466,0.534,0.438,1.0,0.381,0.826,0.0,0.42,0.273,0.657,0.576,0.096,0.017,0.362,0.0,0.421,0.588,0.364,0.294,0.163,0.837,0.312,0.685,0.606,0.083,0.28,0.3,0.348,0.571,198
2021-10-23,0.691,0.652,0.507,0.758,0.455,0.454,0.466,0.593,0.534,0.72,0.286,0.602,0.609,0.561,0.058,0.364,0.257,0.661,0.635,0.586,0.176,0.728,0.263,0.265,0.396,0.672,0.772,0.469,0.125,0.685,0.488,0.708,0.36,0.2,0.174,0.643,239
2021-10-23,0.727,0.826,0.683,0.827,0.53,0.244,0.772,0.297,0.382,0.315,0.571,0.159,0.348,0.918,0.275,0.182,0.029,0.661,0.25,0.069,0.122,0.225,0.368,0.559,0.317,0.588,0.813,0.268,0.0,0.369,0.134,0.208,0.2,0.0,0.348,0.929,232
2021-10-24,0.745,0.783,0.577,0.848,0.833,0.58,0.82,0.254,0.229,0.56,0.357,0.779,0.826,0.765,0.565,0.818,0.543,0.322,1.0,0.897,1.0,0.362,0.842,1.0,0.559,0.504,0.569,0.446,0.625,0.73,0.843,0.833,0.76,0.9,0.478,0.786,204


In [5]:
pm_scaled_df.head()

Unnamed: 0_level_0,ROLL_HOME_PTS,ROLL_HOME_FGM,ROLL_HOME_FGA,ROLL_HOME_FG_PCT,ROLL_HOME_FG3M,ROLL_HOME_FG3A,ROLL_HOME_FG3_PCT,ROLL_HOME_FTM,ROLL_HOME_FTA,ROLL_HOME_FT_PCT,ROLL_HOME_OREB,ROLL_HOME_DREB,ROLL_HOME_REB,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF,PLUS_MINUS
GAME_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
2021-10-23,0.745,0.522,0.296,0.753,0.758,0.58,0.731,0.805,0.878,0.535,0.571,0.292,0.478,0.612,1.0,1.0,0.6,0.661,0.577,0.586,0.202,0.704,0.526,0.176,1.0,0.336,0.285,0.603,0.25,0.369,0.291,0.5,0.28,0.5,0.391,0.5,7.0
2021-10-23,0.0,0.0,0.648,0.0,0.076,0.412,0.0,0.466,0.534,0.438,1.0,0.381,0.826,0.0,0.42,0.273,0.657,0.576,0.096,0.017,0.362,0.0,0.421,0.588,0.364,0.294,0.163,0.837,0.312,0.685,0.606,0.083,0.28,0.3,0.348,0.571,-8.0
2021-10-23,0.691,0.652,0.507,0.758,0.455,0.454,0.466,0.593,0.534,0.72,0.286,0.602,0.609,0.561,0.058,0.364,0.257,0.661,0.635,0.586,0.176,0.728,0.263,0.265,0.396,0.672,0.772,0.469,0.125,0.685,0.488,0.708,0.36,0.2,0.174,0.643,29.0
2021-10-23,0.727,0.826,0.683,0.827,0.53,0.244,0.772,0.297,0.382,0.315,0.571,0.159,0.348,0.918,0.275,0.182,0.029,0.661,0.25,0.069,0.122,0.225,0.368,0.559,0.317,0.588,0.813,0.268,0.0,0.369,0.134,0.208,0.2,0.0,0.348,0.929,-10.0
2021-10-24,0.745,0.783,0.577,0.848,0.833,0.58,0.82,0.254,0.229,0.56,0.357,0.779,0.826,0.765,0.565,0.818,0.543,0.322,1.0,0.897,1.0,0.362,0.842,1.0,0.559,0.504,0.569,0.446,0.625,0.73,0.843,0.833,0.76,0.9,0.478,0.786,-10.0


In [6]:
res_scaled_df.head()

Unnamed: 0_level_0,ROLL_HOME_PTS,ROLL_HOME_FGM,ROLL_HOME_FGA,ROLL_HOME_FG_PCT,ROLL_HOME_FG3M,ROLL_HOME_FG3A,ROLL_HOME_FG3_PCT,ROLL_HOME_FTM,ROLL_HOME_FTA,ROLL_HOME_FT_PCT,ROLL_HOME_OREB,ROLL_HOME_DREB,ROLL_HOME_REB,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF,GAME_RESULT
GAME_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
2021-10-23,0.745,0.522,0.296,0.753,0.758,0.58,0.731,0.805,0.878,0.535,0.571,0.292,0.478,0.612,1.0,1.0,0.6,0.661,0.577,0.586,0.202,0.704,0.526,0.176,1.0,0.336,0.285,0.603,0.25,0.369,0.291,0.5,0.28,0.5,0.391,0.5,1
2021-10-23,0.0,0.0,0.648,0.0,0.076,0.412,0.0,0.466,0.534,0.438,1.0,0.381,0.826,0.0,0.42,0.273,0.657,0.576,0.096,0.017,0.362,0.0,0.421,0.588,0.364,0.294,0.163,0.837,0.312,0.685,0.606,0.083,0.28,0.3,0.348,0.571,0
2021-10-23,0.691,0.652,0.507,0.758,0.455,0.454,0.466,0.593,0.534,0.72,0.286,0.602,0.609,0.561,0.058,0.364,0.257,0.661,0.635,0.586,0.176,0.728,0.263,0.265,0.396,0.672,0.772,0.469,0.125,0.685,0.488,0.708,0.36,0.2,0.174,0.643,1
2021-10-23,0.727,0.826,0.683,0.827,0.53,0.244,0.772,0.297,0.382,0.315,0.571,0.159,0.348,0.918,0.275,0.182,0.029,0.661,0.25,0.069,0.122,0.225,0.368,0.559,0.317,0.588,0.813,0.268,0.0,0.369,0.134,0.208,0.2,0.0,0.348,0.929,0
2021-10-24,0.745,0.783,0.577,0.848,0.833,0.58,0.82,0.254,0.229,0.56,0.357,0.779,0.826,0.765,0.565,0.818,0.543,0.322,1.0,0.897,1.0,0.362,0.842,1.0,0.559,0.504,0.569,0.446,0.625,0.73,0.843,0.833,0.76,0.9,0.478,0.786,0


# Inspect Expanding Training Window

[Return to top](#Models)

In [7]:
# expanding window configuration
initial_train_size = 10  # starting size of the training set
test_size = 1            # leave-one-out (LOO) cross-validation
gap_size=0               # should there be a gap between train and test sets?
expansion_limit=None     # the limit on the test set observations

counter = 0
max_splits_to_show = 15

# show first few splits
for train_indices, test_indices in utl.expanding_window_ts_split(pts_scaled_df, initial_train_size, 
                                                                 test_size=test_size, gap_size=gap_size,
                                                                 expansion_limit=expansion_limit):
    print("TRAIN:", train_indices, "TEST:", test_indices)
    counter += 1
    if counter >= max_splits_to_show:
        break

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10] TEST: [11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12] TEST: [13]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13] TEST: [14]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] TEST: [15]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] TEST: [16]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] TEST: [17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] TEST: [19]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] TEST: [20]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] TEST: [21]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21] TEST: [22]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22] TEST: [23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9

# Functions

[Return to top](#Models)

For some idiotic reason 'XGBoost' requires the training function to be in the notebook rather than imported, at least while using early stopping. Also, we need to pass in validation set data directly to the `.fit` method, which requires a slight modification to the function in the `utl` library.

In [8]:
def train_with_expanding_window(df, initial_train_size, test_size, target_col, model, gap_size=0,
                                expansion_limit=None, fitted_model=False, ensure_diversity=False):
    """
    Trains a given model using an expanding window approach on a specified DataFrame.

    Parameters:
    - df (pd.DataFrame): The DataFrame containing the features and target variable.
    - initial_train_size (int): The initial size of the training dataset.
    - test_size (int): The size of the test dataset for each split, typically 1 for LOO CV.
    - target_col (str): The name of the target column in `df`.
    - model (model object): The instantiated model to be trained, e.g., LinearRegression() or LogisticRegression().
    - gap_size (int): The gap size between training and test datasets. Default is 0.
    - expansion_limit (int, optional): The maximum number of times the training set is expanded by 1 observation during the expanding window process. This parameter controls the total number of train-test splits generated, indirectly determining the final size of the training set. If set, the training process will stop once this limit is reached, potentially leaving some data unused. If None, the training set will expand until all but the last observation are used for training.
    - fitted_model (bool): whether to return the fitted model instance.
    - ensure_diversity (bool, optional): For logistic regression, ensures the initial training data includes both classes. Default is False.

    Returns:
    - model_outputs (list): A list of model predictions or probabilities for the test sets across all splits.
    - y_true (list): A list of the actual target values corresponding to each prediction in `model_outputs`.

    This function iterates over the dataset using an expanding window to create training and test splits, 
    trains the specified `model` on each training split, and stores the model's predictions or probabilities.
    """
    import time
    
    start_time = time.time()

    # initialize storage for model outputs and true labels
    model_outputs = []  # store predictions or probabilities
    y_true = []

    for train_indices, test_indices in utl.expanding_window_ts_split(
        df, initial_train_size, test_size=test_size, gap_size=gap_size,
        expansion_limit=expansion_limit, ensure_diversity=ensure_diversity, 
        target_col=target_col if ensure_diversity else None):
        
        # get training and testing data for this window
        X_train = df.iloc[train_indices].drop(columns=target_col)
        y_train = df.iloc[train_indices][target_col]
        X_test = df.iloc[test_indices].drop(columns=target_col)
        y_test = df.iloc[test_indices][target_col]

        # train the model
        if isinstance(model, (XGBRegressor, XGBClassifier)):
            model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
        else:
            model.fit(X_train, y_train) # fallback
        
        # check if the model has the predict_proba method (i.e., likely a classifier)
        if hasattr(model, 'predict_proba'):
            # store predicted probabilities of the positive class
            proba = model.predict_proba(X_test)[:, 1]
            model_outputs.extend(list(proba.flatten()))  # ensure it's flattened
        elif hasattr(model, 'predict'):
            # predict for models that support predict (regressors and classifiers without predict_proba)
            predictions = model.predict(X_test)
            model_outputs.extend(list(predictions.flatten()))  # ensure it's flattened
        else:
            raise ValueError("Model does not support required prediction methods.")

        # store true labels for evaluation
        y_true.extend(list(y_test))

    end_time = time.time()
    print(f"Total time taken: {end_time - start_time:.2f} seconds")
    
    if fitted_model:
        return model, model_outputs, y_true
    else:
        return model_outputs, y_true

In [9]:
def train_models_over_grid(df, target_col, initial_train_size, test_size, gap_size,   
                           expansion_limit, model_class, constant_params, explore_params):
    """
    Trains models over a grid of hyperparameters.

    Parameters:
    - df (pd.DataFrame): The dataset to use for training.
    - target_col (str): The name of the target column.
    - initial_train_size (int): Starting size of the training set.
    - test_size (int): Size of the test dataset for each split (LOO cross-validation).
    - gap_size (int): The gap size between training and test datasets. Default is 0.
    - expansion_limit (int): Maximum number of new training observations in expansion.
    - model_class: The class of the model to instantiate.
    - constant_params (dict): Constant parameters for the model.
    - explore_params (dict): Parameters to explore with grid search.

    Returns:
    - dict: A dictionary containing the results for each run.
    """
    import itertools
    import time

    results = {}
    keys, values = zip(*explore_params.items())
    param_combinations = [dict(zip(keys, v)) for v in itertools.product(*values)]

    start_time = time.time()

    for i, explore_param in enumerate(param_combinations):
        print('Parameters currently explored:', explore_param)
        
        # instantiate the model with combined parameters
        model = model_class(**constant_params, **explore_param)

        # train over expanding window
        model_outputs, y_true = train_with_expanding_window(
            df=df,
            initial_train_size=initial_train_size,
            test_size=test_size,
            gap_size=gap_size,
            expansion_limit=expansion_limit,
            target_col=target_col,
            model=model
        )
        
        # store outputs and true values in the results dictionary
        results[f"run_{i}"] = {
            "params": {**explore_param},
            "model_outputs": model_outputs,
            "y_true": y_true
        }

    end_time = time.time()
    print(f"Total time taken: {end_time - start_time:.2f} seconds")
    return results

<a name="1.-Target:-Total-Points-(over-/-under)"></a>
# 1. Target: Total Points (over / under)

[Return to top](#Models)

In [10]:
# configuration for expanding window
results = train_models_over_grid(
    model_class=XGBRegressor, # model class
    target_col='TOTAL_PTS', # target column name
    df=pts_scaled_df, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=20,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=500, # maximum number of new training observations in expansion
    constant_params={
         'random_state': 599,
         'n_jobs': -1, 
         'objective': 'reg:squarederror',
         'eval_metric': 'rmse',
         'early_stopping_rounds': 20,
         'booster': 'gbtree',
         'n_estimators': 500   # tried: 100, 500, 1000
    },
    explore_params={  
        'learning_rate': [1, 2, 3],      # tried: 0.01, 0.1, 0.5, 1, 2
        'max_depth': [4, 6, 8],            # tried: 2, 4, 6
        'alpha': [0.5, 1, 1.5],            # tried: 1, 2
        'lambda': [8, 10, 12],              # tried: 1, 2, 5, 10
        'gamma': [3, 5, 7]                # tried: 1, 2, 5, 10
    }
)

Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 8, 'gamma': 3}
Total time taken: 2.37 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 8, 'gamma': 5}
Total time taken: 2.47 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 8, 'gamma': 7}
Total time taken: 2.39 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 10, 'gamma': 3}
Total time taken: 2.26 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 10, 'gamma': 5}
Total time taken: 2.11 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 10, 'gamma': 7}
Total time taken: 2.20 seconds
Parameters currently explored: {'learning_rate': 1, 'max_depth': 4, 'alpha': 0.5, 'lambda': 12, 'gamma': 3}
Total time taken: 2.34 seconds
Parameters currently explored:

In [11]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,alpha,average_rmse,gamma,lambda,learning_rate,max_depth,null_rmse
18,run_18,1.5,18.024,3,8,1,4,18.708
20,run_20,1.5,18.048,7,8,1,4,18.708
19,run_19,1.5,18.048,5,8,1,4,18.708
17,run_17,1.0,18.113,7,12,1,4,18.708
7,run_7,0.5,18.117,5,12,1,4,18.708


In [12]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/XGB_pts_best_params_boxscores.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="2.-Target:-Difference-in-Points-(plus-/-minus)"></a>
# 2. Target: Difference in Points (plus / minus)

[Return to top](#Models)

In [13]:
# configuration for expanding window
results = train_models_over_grid(
    model_class=XGBRegressor, # model class
    target_col='PLUS_MINUS', # target column name
    df=pm_scaled_df, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=20,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=500, # maximum number of new training observations in expansion
    constant_params={
        'random_state': 599,
        'n_jobs': -1,
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse',
        'early_stopping_rounds': 20,
        'booster': 'gbtree',
        'n_estimators': 500
    },
    explore_params={
        'learning_rate': [0.5, 1, 2],    # tried: 0.001, 0.01, 0.1, 0.5, 1.0
        'max_depth': [2, 4, 6],          # tried: 1, 2, 3, 4
        'alpha': [0.1, 0.5, 1],           # tried: 0.1, 1, 2
        'lambda': [3, 5, 7],             # tried: 0.1, 1, 2, 5, 10
        'gamma': [3, 5, 7]               # tried: 0.1, 1, 2, 5, 10
    }
)

Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 3, 'gamma': 3}
Total time taken: 1.71 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 3, 'gamma': 5}
Total time taken: 1.84 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 3, 'gamma': 7}
Total time taken: 2.11 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 5, 'gamma': 3}
Total time taken: 1.99 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 5, 'gamma': 5}
Total time taken: 2.05 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 5, 'gamma': 7}
Total time taken: 2.17 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 0.1, 'lambda': 7, 'gamma': 3}
Total time taken: 2.09 seconds
Parameters currently

In [14]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,alpha,average_rmse,gamma,lambda,learning_rate,max_depth,null_rmse
25,run_25,1.0,12.676,5,7,0.5,2,12.93
24,run_24,1.0,12.676,3,7,0.5,2,12.93
26,run_26,1.0,12.676,7,7,0.5,2,12.93
17,run_17,0.5,12.685,7,7,0.5,2,12.93
16,run_16,0.5,12.685,5,7,0.5,2,12.93


In [15]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/XGB_pm_best_params_boxscores.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="3.-Target:-Game-Winner-(moneyline)"></a>
# 3. Target: Game Winner (moneyline)

[Return to top](#Models)

In [16]:
# configuration for expanding window
results = train_models_over_grid(
    model_class=XGBClassifier, # model class
    target_col='GAME_RESULT', # target column name
    df=res_scaled_df, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=20,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=500, # maximum number of new training observations in expansion
    constant_params={
        'random_state': 599,
        'n_jobs': -1,
        'objective': 'binary:logistic',
        'eval_metric': 'error',
        'early_stopping_rounds': 20,
        'booster': 'gbtree',
        'n_estimators': 500
    },
    explore_params={
        'learning_rate': [0.5, 1, 2],    # tried: 0.001, 0.01, 0.1, 0.5, 1.0
        'max_depth': [2, 4, 6],             # tried: 1, 2, 3, 4
        'alpha': [1, 2],                    # tried: 0.1, 1, 2
        'lambda': [3, 5, 7],               # tried: 0.1, 1, 2, 5, 10
        'gamma': [3, 5, 7]                 # tried: 0.1, 1, 2, 5, 10
    }
)

Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 3, 'gamma': 3}
Total time taken: 2.24 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 3, 'gamma': 5}
Total time taken: 1.65 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 3, 'gamma': 7}
Total time taken: 1.90 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 5, 'gamma': 3}
Total time taken: 1.93 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 5, 'gamma': 5}
Total time taken: 1.92 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 5, 'gamma': 7}
Total time taken: 1.74 seconds
Parameters currently explored: {'learning_rate': 0.5, 'max_depth': 2, 'alpha': 1, 'lambda': 7, 'gamma': 3}
Total time taken: 1.94 seconds
Parameters currently explored: {'l

In [17]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_accuracy', ascending=False).head()

Unnamed: 0,run_id,alpha,average_accuracy,average_f1_score,gamma,lambda,learning_rate,max_depth,overall_auc,pred_labels
114,run_114,1,0.708,0.771,3,7,2.0,2,0.633,"[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1,..."
111,run_111,1,0.702,0.763,3,5,2.0,2,0.649,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,..."
131,run_131,1,0.702,0.758,7,5,2.0,4,0.647,"[1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,..."
112,run_112,1,0.702,0.765,5,5,2.0,2,0.647,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,..."
127,run_127,1,0.696,0.755,5,3,2.0,4,0.643,"[1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,..."


In [18]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_accuracy')

# save the dictionary to a file
with open('../../hyperparameters/XGB_res_best_params_boxscores.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)