<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">K-Nearest Neighbors with Selected Features</h4>
    <h5 style="font-weight: bold; font-size: 24px;">Hyperparameter Tuning using Expanding Window</h5>
    <p style="font-size: 20px;">NBA API Seasons 2021-22 to 2023-24</p>
</div>

<a name="Models"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

[Inspect Expanding Training Window](#Inspect-Training-Windows)

**[1. Target: Total Points (over / under)](#1.-Target:-Total-Points-(over-/-under))**
  
**[2. Target: Difference in Points (plus / minus)](#2.-Target:-Difference-in-Points-(plus-/-minus))**

**[3. Target: Game Winner (moneyline)](#3.-Target:-Game-Winner-(moneyline))**

# Setup

[Return to top](#Models)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Data

[Return to top](#Models)

Data splits:

- Define NBA Season 2021-22 as the TRAINING set: regular season is 2021-10-19 to 2022-04-10. 
- Define NBA Season 2022-23 as the VALIDATION set: regular season is 2022-10-18 to 2023-04-09.
- Define NBA Season 2023-24 as the TESTING set: regular season is 2023-10-24 to 2024-04-14.

In [2]:
# load, filter (by time) and scale data
pts_all_features, pm_all_features, res_all_features, test_set_obs = utl.load_and_scale_data(
    input_data='../../data/processed/nba_team_matchups_all_rolling_stats_merged_2021_2024_r05.csv',
    seasons_to_keep=['2021-22', '2022-23', '2023-24'],
    training_season='2021-22',
    feature_prefixes=['ROLL_', 'ROLLDIFF_'],
    scaler_type='minmax', 
    scale_target=False
)

Season 2021-22: 1186 games
Season 2022-23: 1181 games
Season 2023-24: 691 games
Total number of games across sampled seasons: 3058 games


In [3]:
# define number of games in seasons
season_22_ngames = 1186
season_23_ngames = 1181
season_24_ngames = 691

In [4]:
# load the best features dictionaries back from the file
with open('../../data/selected_features/feature_set_01_filter_and_wrapper.json', 'r') as json_file:
    selected_features_filter_and_wrapper = json.load(json_file)

with open('../../data/selected_features/feature_set_02_embedded.json', 'r') as json_file:
    selected_features_embedded = json.load(json_file)

In [5]:
# subset the features
pts_sub_fw_features = pts_all_features[selected_features_filter_and_wrapper['TOTAL_PTS'] + ['TOTAL_PTS']]
pts_sub_e_features = pts_all_features[selected_features_embedded['TOTAL_PTS'] + ['TOTAL_PTS']]

pm_sub_fw_features = pm_all_features[selected_features_filter_and_wrapper['PLUS_MINUS'] + ['PLUS_MINUS']]
pm_sub_e_features = pm_all_features[selected_features_embedded['PLUS_MINUS'] + ['PLUS_MINUS']]

res_sub_fw_features = res_all_features[selected_features_filter_and_wrapper['GAME_RESULT'] + ['GAME_RESULT']]
res_sub_e_features = res_all_features[selected_features_embedded['GAME_RESULT'] + ['GAME_RESULT']]

In [6]:
pts_sub_fw_features.head()

Unnamed: 0_level_0,ROLL_HOME_FTM,ROLL_HOME_OPP_PTS_PAINT,ROLL_HOME_PTS_PAINT,ROLL_AWAY_PTS,ROLL_HOME_PTS_FB,ROLL_AWAY_OFF_LOOSE_BALLS_RECOVERED,ROLL_AWAY_DEF_BOXOUTS,ROLL_AWAY_estimatedPace,ROLL_AWAY_PTS_PAINT,ROLL_AWAY_DFG_PCT,ROLL_HOME_AST,ROLL_AWAY_assistToTurnover,ROLL_HOME_OPP_TOV_PCT,ROLL_HOME_estimatedDefensiveRating,TOTAL_PTS
GAME_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2021-10-23,0.805,0.808,0.5,0.577,1.0,0.6,0.208,0.443,0.391,1.0,0.612,0.291,0.84,0.298,185
2021-10-23,0.466,0.758,0.25,0.096,0.581,0.3,0.307,0.522,0.0,0.453,0.0,0.161,0.762,0.176,198
2021-10-23,0.593,0.505,0.7,0.635,0.065,0.75,0.208,0.465,0.565,0.023,0.561,0.589,0.144,0.717,239
2021-10-23,0.297,0.606,0.7,0.25,0.839,0.3,0.109,0.691,0.261,0.965,0.918,0.218,0.519,0.328,232
2021-10-24,0.254,0.202,0.7,1.0,0.548,0.45,0.455,0.646,0.348,0.732,0.765,0.393,0.591,0.002,204


# Inspect Expanding Training Window

[Return to top](#Models)

In [7]:
# expanding window configuration
initial_train_size = 10  # starting size of the training set
test_size = 1            # leave-one-out (LOO) cross-validation
gap_size=0               # should there be a gap between train and test sets?
expansion_limit=None     # the limit on the test set observations

counter = 0
max_splits_to_show = 15

# show first few splits
for train_indices, test_indices in utl.expanding_window_ts_split(pts_all_features, initial_train_size, 
                                                                 test_size=test_size, gap_size=gap_size,
                                                                 expansion_limit=expansion_limit):
    print("TRAIN:", train_indices, "TEST:", test_indices)
    counter += 1
    if counter >= max_splits_to_show:
        break

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10] TEST: [11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12] TEST: [13]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13] TEST: [14]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] TEST: [15]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] TEST: [16]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] TEST: [17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] TEST: [19]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] TEST: [20]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] TEST: [21]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21] TEST: [22]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22] TEST: [23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9

<a name="1.-Target:-Total-Points-(over-/-under)"></a>
# 1. Target: Total Points (over / under)

[Return to top](#Models)

In [8]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=KNeighborsRegressor, # model class
    target_col='TOTAL_PTS', # target column name
    df=pts_sub_fw_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'n_jobs': -1,
        'metric': 'minkowski'
    },
    explore_params={
        'n_neighbors': [50, 60, 70, 80, 90, 100], # tried: 50, 60, 70, 80, 90, 100
        'weights': ['uniform', 'distance'],      # tried: 'uniform', 'distance'
        'p': [1, 2]                              # tried: 1, 2
    }
)

Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 1}
Total time taken: 0.07 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 2}
Total time taken: 0.05 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 1}
Total time taken: 0.05 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 2}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 1}
Total time taken: 0.05 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 2}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 1}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 2}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 70, 'weights': 'uniform', 'p': 1}
Total time taken: 0

In [9]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,average_rmse,metric,n_jobs,n_neighbors,null_rmse,p,weights
7,run_7,19.531,minkowski,-1,60,19.858,2,distance
5,run_5,19.535,minkowski,-1,60,19.858,2,uniform
11,run_11,19.544,minkowski,-1,70,19.858,2,distance
15,run_15,19.544,minkowski,-1,80,19.858,2,distance
10,run_10,19.548,minkowski,-1,70,19.858,1,distance


In [10]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/KNN_pts_best_params_selected_fw_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="2.-Target:-Difference-in-Points-(plus-/-minus)"></a>
# 2. Target: Difference in Points (plus / minus)

[Return to top](#Models)

In [11]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=KNeighborsRegressor, # model class
    target_col='PLUS_MINUS', # target column name
    df=pm_sub_fw_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'n_jobs': -1,
        'metric': 'minkowski'
    },
    explore_params={
        'n_neighbors': [50, 60, 70, 80, 90, 100],  # tried: 50, 60, 70, 80, 90, 100
        'weights': ['uniform', 'distance'],     # tried: 'uniform', 'distance'
        'p': [1, 2]                             # tried: 1, 2
    }
)

Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 1}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 1}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 1}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 1}
Total time taken: 0.04 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 70, 'weights': 'uniform', 'p': 1}
Total time taken: 0

In [12]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_rmse', ascending=True).head()

Unnamed: 0,run_id,average_rmse,metric,n_jobs,n_neighbors,null_rmse,p,weights
9,run_9,13.826,minkowski,-1,70,14.254,2,uniform
17,run_17,13.831,minkowski,-1,90,14.254,2,uniform
13,run_13,13.831,minkowski,-1,80,14.254,2,uniform
19,run_19,13.834,minkowski,-1,90,14.254,2,distance
11,run_11,13.834,minkowski,-1,70,14.254,2,distance


In [13]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_rmse')

# save the dictionary to a file
with open('../../hyperparameters/KNN_pm_best_params_selected_fw_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)

<a name="3.-Target:-Game-Winner-(moneyline)"></a>
# 3. Target: Game Winner (moneyline)

[Return to top](#Models)

In [14]:
# configuration for expanding window
results = utl.train_models_over_grid(
    model_class=KNeighborsClassifier, # model class
    target_col='GAME_RESULT', # target column name
    df=res_sub_fw_features, # data set to use
    initial_train_size=season_22_ngames, # starting size of the training set
    test_size=season_23_ngames,  # leave-one-out (LOO) cross-validation
    gap_size = 0,  # should there be a gap between train and test sets?
    expansion_limit=None, # maximum number of new training observations in expansion
    constant_params={
        'n_jobs': -1,
        'metric': 'minkowski'
    },
    explore_params={
        'n_neighbors': [50, 60, 70, 80, 90, 100],    # tried: 50, 60, 70, 80, 90, 100
        'weights': ['uniform', 'distance'], # tried: 'uniform', 'distance'
        'p': [1, 2]                         # tried: 1, 2
    }
)

Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 1}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'uniform', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 1}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 50, 'weights': 'distance', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 1}
Total time taken: 0.06 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'uniform', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 1}
Total time taken: 0.05 seconds
Parameters currently explored: {'n_neighbors': 60, 'weights': 'distance', 'p': 2}
Total time taken: 0.03 seconds
Parameters currently explored: {'n_neighbors': 70, 'weights': 'uniform', 'p': 1}
Total time taken: 0

In [15]:
# get metrics for each combination of parameter values
results_df = utl.compile_results_to_dataframe(results)

# print best hyperparameter settings
results_df.sort_values(by='average_accuracy', ascending=False).head()

Unnamed: 0,run_id,average_accuracy,average_f1_score,baseline_accuracy,metric,n_jobs,n_neighbors,overall_auc,p,pred_labels,weights
12,run_12,0.595,0.69,0.573,minkowski,-1,80,0.607,1,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...",uniform
16,run_16,0.593,0.69,0.573,minkowski,-1,90,0.606,1,"[1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...",uniform
4,run_4,0.592,0.683,0.573,minkowski,-1,60,0.603,1,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,...",uniform
13,run_13,0.591,0.684,0.573,minkowski,-1,80,0.613,2,"[1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,...",uniform
19,run_19,0.591,0.69,0.573,minkowski,-1,90,0.612,2,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,...",distance


In [16]:
# get best parameters from validation as dictionary
best_params = utl.get_best_params(results_df, metric='average_accuracy')

# save the dictionary to a file
with open('../../hyperparameters/KNN_res_best_params_selected_fw_features.json', 'w') as json_file:
    json.dump(best_params, json_file, default=utl.handle_non_serializable, indent=4)