<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Random Forest Models with All Features</h4>
    <h5 style="font-weight: bold; font-size: 24px;">Hyperparameter Tuning using Expanding Window</h5>
    <p style="font-size: 20px;">NBA API Seasons 2021-22 to 2023-24</p>
</div>

<a name="Models"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

[Inspect Expanding Training Window](#Inspect-Training-Windows)

**[1. Target: Total Points (over / under)](#1.-Target:-Total-Points-(over-/-under))**
  
**[2. Target: Difference in Points (plus / minus)](#2.-Target:-Difference-in-Points-(plus-/-minus))**

**[3. Target: Game Winner (moneyline)](#3.-Target:-Game-Winner-(moneyline))**

# Setup

[Return to top](#Models)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Data

[Return to top](#Models)

Data splits:

- Define NBA Season 2021-22 as the TRAINING set: regular season is 2021-10-19 to 2022-04-10. 
- Define NBA Season 2022-23 as the VALIDATION set: regular season is 2022-10-18 to 2023-04-09.
- Define NBA Season 2023-24 as the TESTING set: regular season is 2023-10-24 to 2024-04-14.

In [2]:
# load, filter (by time) and scale data
pts_all_features, pm_all_features, res_all_features, test_set_obs = utl.load_and_scale_data(
    input_data='../../data/processed/nba_team_matchups_all_rolling_stats_merged_2021_2024_r05.csv',
    seasons_to_keep=['2021-22', '2022-23', '2023-24'],
    training_season='2021-22',
    feature_prefixes=['ROLL_', 'ROLLDIFF_'],
    scaler_type='minmax', 
    scale_target=False
)

Season 2021-22: 1186 games
Season 2022-23: 1181 games
Season 2023-24: 691 games
Total number of games across sampled seasons: 3058 games


In [3]:
# define number of games in seasons
season_22_ngames = 1186
season_23_ngames = 1181
season_24_ngames = 691

In [4]:
pts_all_features.head()

Unnamed: 0_level_0,ROLL_HOME_PTS,ROLL_HOME_FGM,ROLL_HOME_FGA,ROLL_HOME_FG_PCT_x,ROLL_HOME_FG3M,ROLL_HOME_FG3A,ROLL_HOME_FG3_PCT,ROLL_HOME_FTM,ROLL_HOME_FTA,ROLL_HOME_FT_PCT,ROLL_HOME_OREB,ROLL_HOME_DREB,ROLL_HOME_REB,ROLL_HOME_AST_x,ROLL_HOME_STL,ROLL_HOME_BLK_x,ROLL_HOME_TOV,ROLL_HOME_PF_x,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT_x,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,...,ROLL_HOME_UFG_PCT,ROLL_HOME_FG_PCT_y,ROLL_HOME_DFGM,ROLL_HOME_DFGA,ROLL_HOME_DFG_PCT,ROLL_AWAY_DIST,ROLL_AWAY_ORBC,ROLL_AWAY_DRBC,ROLL_AWAY_RBC,ROLL_AWAY_TCHS,ROLL_AWAY_SAST,ROLL_AWAY_FTAST,ROLL_AWAY_PASS,ROLL_AWAY_AST_y,ROLL_AWAY_CFGM,ROLL_AWAY_CFGA,ROLL_AWAY_CFG_PCT,ROLL_AWAY_UFGM,ROLL_AWAY_UFGA,ROLL_AWAY_UFG_PCT,ROLL_AWAY_FG_PCT_y,ROLL_AWAY_DFGM,ROLL_AWAY_DFGA,ROLL_AWAY_DFG_PCT,TOTAL_PTS
GAME_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
2021-10-23,0.745,0.522,0.296,0.753,0.758,0.58,0.731,0.805,0.878,0.535,0.571,0.292,0.478,0.612,1.0,1.0,0.6,0.661,0.577,0.586,0.202,0.704,0.526,0.176,1.0,...,0.761,0.753,0.909,1.0,0.352,0.665,0.613,0.119,0.252,0.53,0.25,0.714,0.531,0.5,0.595,0.704,0.564,0.409,0.19,0.738,0.704,0.62,0.423,1.0,185
2021-10-23,0.0,0.0,0.648,0.0,0.076,0.412,0.0,0.466,0.534,0.438,1.0,0.381,0.826,0.0,0.42,0.273,0.657,0.576,0.096,0.017,0.362,0.0,0.421,0.588,0.364,...,0.0,0.0,0.909,0.655,0.762,0.619,0.387,0.489,0.409,0.579,0.125,0.179,0.517,0.083,0.0,0.458,0.0,0.364,0.453,0.301,0.0,0.185,0.269,0.453,198
2021-10-23,0.691,0.652,0.507,0.758,0.455,0.454,0.466,0.593,0.534,0.72,0.286,0.602,0.609,0.561,0.058,0.364,0.257,0.661,0.635,0.586,0.176,0.728,0.263,0.265,0.396,...,0.676,0.758,0.398,0.276,0.72,0.591,0.032,0.744,0.37,0.66,0.375,0.357,0.638,0.708,0.238,0.246,0.581,0.682,0.433,0.749,0.728,0.076,0.423,0.023,239
2021-10-23,0.727,0.826,0.683,0.827,0.53,0.244,0.772,0.297,0.382,0.315,0.571,0.159,0.348,0.918,0.275,0.182,0.029,0.661,0.25,0.069,0.122,0.225,0.368,0.559,0.317,...,0.712,0.827,0.455,0.586,0.291,0.715,0.452,0.46,0.409,0.46,0.375,0.179,0.403,0.208,0.357,0.282,0.735,0.136,0.372,0.075,0.225,0.511,0.346,0.965,232
2021-10-24,0.745,0.783,0.577,0.848,0.833,0.58,0.82,0.254,0.229,0.56,0.357,0.779,0.826,0.765,0.565,0.818,0.543,0.322,1.0,0.897,1.0,0.362,0.842,1.0,0.559,...,0.85,0.848,0.341,0.655,0.075,1.0,0.903,0.46,0.665,1.0,0.875,0.536,0.859,0.833,0.179,0.352,0.355,1.0,1.0,0.47,0.362,0.783,0.731,0.732,204


# Inspect Expanding Training Window

[Return to top](#Models)

In [5]:
# expanding window configuration
initial_train_size = 10  # starting size of the training set
test_size = 1            # leave-one-out (LOO) cross-validation
gap_size=0               # should there be a gap between train and test sets?
expansion_limit=None     # the limit on the test set observations

counter = 0
max_splits_to_show = 15

# show first few splits
for train_indices, test_indices in utl.expanding_window_ts_split(pts_all_features, initial_train_size, 
                                                                 test_size=test_size, gap_size=gap_size,
                                                                 expansion_limit=expansion_limit):
    print("TRAIN:", train_indices, "TEST:", test_indices)
    counter += 1
    if counter >= max_splits_to_show:
        break

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10] TEST: [11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12] TEST: [13]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13] TEST: [14]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] TEST: [15]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] TEST: [16]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] TEST: [17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] TEST: [19]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] TEST: [20]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] TEST: [21]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21] TEST: [22]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22] TEST: [23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9

<a name="1.-Target:-Total-Points-(over-/-under)"></a>
# 1. Target: Total Points (over / under)

[Return to top](#Models)

In [6]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=RandomForestRegressor, # model class
    target_col='TOTAL_PTS', # target column name
    df=pts_all_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size=0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'random_state': 599,
        'n_jobs': -1,
        'n_estimators': 500
    },
    explore_params={
        'max_depth': [18, 20, 22],           # tried: 10, 15, 18, 20, 22, 25
        'min_samples_split': [3, 4, 5],      # tried: 2, 4, 5, 6
        'min_samples_leaf': [2, 3],          # tried: 1, 2, 3
        'max_features': [0.3, 0.4],          # tried: 0.3, 0.4, 0.5
        'min_impurity_decrease': [0.1, 0.3]  # tried: 0.1, 0.3
    }
)

Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 0.3, 'min_impurity_decrease': 0.1}
Total time taken: 7.04 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 0.3, 'min_impurity_decrease': 0.3}
Total time taken: 7.06 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 0.4, 'min_impurity_decrease': 0.1}
Total time taken: 10.15 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 0.4, 'min_impurity_decrease': 0.3}
Total time taken: 9.79 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 0.3, 'min_impurity_decrease': 0.1}
Total time taken: 7.40 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features

In [7]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,average_rmse,max_depth,max_features,min_impurity_decrease,min_samples_leaf,min_samples_split,n_estimators,n_jobs,null_rmse,random_state
66,run_66,19.378,22,0.4,0.1,2,5,500,-1,19.858,599
42,run_42,19.378,20,0.4,0.1,2,5,500,-1,19.858,599
22,run_22,19.38,18,0.4,0.1,3,5,500,-1,19.858,599
6,run_6,19.38,18,0.4,0.1,3,3,500,-1,19.858,599
14,run_14,19.38,18,0.4,0.1,3,4,500,-1,19.858,599


In [8]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/RF_pts_best_params_all_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="2.-Target:-Difference-in-Points-(plus-/-minus)"></a>
# 2. Target: Difference in Points (plus / minus)

[Return to top](#Models)

In [9]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=RandomForestRegressor, # model class
    target_col='PLUS_MINUS', # target column name
    df=pm_all_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'random_state': 599,
        'n_jobs': -1,
        'n_estimators': 500
    },
    explore_params={
        'max_depth': [18, 20, 22],           # tried: 10, 15, 18, 20, 22, 25
        'min_samples_split': [2, 4],         # tried: 2, 4, 6
        'min_samples_leaf': [1, 2],          # tried: 1, 2
        'max_features': [0.3, 0.5],          # tried: 0.3, 0.5
        'min_impurity_decrease': [0.1, 0.3]  # tried: 0.1, 0.3
    }
)

Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.3, 'min_impurity_decrease': 0.1}
Total time taken: 7.33 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.3, 'min_impurity_decrease': 0.3}
Total time taken: 7.43 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.5, 'min_impurity_decrease': 0.1}
Total time taken: 14.18 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.5, 'min_impurity_decrease': 0.3}
Total time taken: 12.54 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 0.3, 'min_impurity_decrease': 0.1}
Total time taken: 7.85 seconds
Parameters currently explored: {'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_feature

In [10]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,average_rmse,max_depth,max_features,min_impurity_decrease,min_samples_leaf,min_samples_split,n_estimators,n_jobs,null_rmse,random_state
7,run_7,13.669,18,0.5,0.3,2,2,500,-1,14.254,599
15,run_15,13.669,18,0.5,0.3,2,4,500,-1,14.254,599
24,run_24,13.674,20,0.3,0.1,1,4,500,-1,14.254,599
23,run_23,13.674,20,0.5,0.3,2,2,500,-1,14.254,599
31,run_31,13.674,20,0.5,0.3,2,4,500,-1,14.254,599


In [11]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/RF_pm_best_params_all_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="3.-Target:-Game-Winner-(moneyline)"></a>
# 3. Target: Game Winner (moneyline)

[Return to top](#Models)

In [12]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=RandomForestClassifier, # model class
    target_col='GAME_RESULT', # target column name
    df=res_all_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size=0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'random_state': 599,
        'n_jobs': -1,
        'n_estimators': 500,
        'max_features': 'sqrt',
        'criterion': 'gini'
    },
    explore_params={
        'max_depth': [2, 4, 6, 8],            # tried: 6, 8, 10, 12, 15, 20, 25
        'min_samples_split': [2, 4],          # tried: 2, 4, 6
        'min_samples_leaf': [1, 2],           # tried: 1, 2
        'min_impurity_decrease': [0.1, 0.3]   # tried: 0.1, 0.3
    }
)

Parameters currently explored: {'max_depth': 2, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.1}
Total time taken: 1.27 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.3}
Total time taken: 1.19 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_split': 2, 'min_samples_leaf': 2, 'min_impurity_decrease': 0.1}
Total time taken: 1.12 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_split': 2, 'min_samples_leaf': 2, 'min_impurity_decrease': 0.3}
Total time taken: 1.16 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_split': 4, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.1}
Total time taken: 1.15 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_split': 4, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.3}
Total time taken: 1.14 seconds
Parameters currently explored: {'max_depth': 2, 'min_samples_spl

In [13]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_accuracy', ascending=False).head()

Unnamed: 0,run_id,average_accuracy,average_f1_score,baseline_accuracy,criterion,max_depth,max_features,min_impurity_decrease,min_samples_leaf,min_samples_split,n_estimators,n_jobs,overall_auc,pred_labels,random_state
0,run_0,0.573,0.728,0.573,gini,2,sqrt,0.1,1,2,500,-1,0.489,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",599
1,run_1,0.573,0.728,0.573,gini,2,sqrt,0.3,1,2,500,-1,0.489,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",599
30,run_30,0.573,0.728,0.573,gini,8,sqrt,0.1,2,4,500,-1,0.489,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",599
29,run_29,0.573,0.728,0.573,gini,8,sqrt,0.3,1,4,500,-1,0.489,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",599
28,run_28,0.573,0.728,0.573,gini,8,sqrt,0.1,1,4,500,-1,0.489,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",599


In [14]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_accuracy')

# save the dictionary to a file
with open('../../hyperparameters/RF_res_best_params_all_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)