## Creating the Training Data

In this notebook, we will create the training data to be used by the various models for predicting scores in the NCAA basketball tournament. We will start by generating features for each dataset. Then, we will use blocking to reduce the training data to only include games that include games between tournament caliber teams. Finally, we will combine data sets to create training data for a machine learning model.

In [60]:
# Import packages
import sys
sys.path.append('./College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'2023'

## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each statistical attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom for each dataset. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

We will create a dataset with these features for each set of statistics (Kenpom, T-Rank, basic) for each year that these stats are available. Additionally, we will create a dataset that includes all three of these sets of statistics in a single data set.

In [61]:
# Load the joined datasets
load_path = './Data/Combined_Data/'
kenpom = pd.read_csv(f'{load_path}Kenpom.csv')
TRank = pd.read_csv(f'{load_path}TRank.csv')
stats = pd.read_csv(f'{load_path}Basic.csv')

In [62]:
# Generate features for Kenpom data
kenpom_vecs = cbb.gen_kenpom_features(kenpom)

# Take a look
print(f'There are {len(kenpom_vecs)} games in the Kenpom dataset.')
print(f'There are {len(cbb.filter_tournament(kenpom_vecs))} tournament games in the Kenpom dataset.')
kenpom_vecs.head(3)

There are 5528 games in the Kenpom dataset.
There are 5498 tournament games in the Kenpom dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,...,OppD Rank_Fav,OppD Rank,OppD Rank_Diff,NCSOS AdjEM_Fav,NCSOS AdjEM,NCSOS AdjEM_Diff,NCSOS AdjEM Rank_Fav,NCSOS AdjEM Rank,NCSOS AdjEM Rank_Diff,Label
0,Kansas,North Carolina Central,2024,,0.6875,0.580645,0.106855,23,255,-232,...,2,284,-282,5.25,0.58,4.67,68,181,-113,0
1,Duke,Dartmouth,2024,,0.75,0.222222,0.527778,8,336,-328,...,64,200,-136,0.22,-0.56,0.78,200,220,-20,0
2,Purdue,Samford,2024,,0.878788,0.852941,0.025847,3,81,-78,...,8,215,-207,10.36,-5.2,15.56,13,323,-310,0


In [63]:
# Generate features for T-Rank data
TRank_vecs = cbb.gen_TRank_features(TRank)

# Take a look
print(f'There are {len(TRank_vecs)} games in the T-Rank dataset.'.format())
print(f'There are {len(cbb.filter_tournament(TRank_vecs))} games in the march T-Rank dataset.')
TRank_vecs.head(3)

There are 5528 games in the T-Rank dataset.
There are 5498 games in the march T-Rank dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rk_Fav,Rk,Rk_Diff,...,WAB_Fav,WAB,WAB_Diff,WAB Rank_Fav,WAB Rank,WAB Rank_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,Kansas,North Carolina Central,2024,,0.6875,0.580645,0.106855,17,265,-248,...,4.1,-9.6,13.7,14,196,-182,18.96,-6.74,25.7,0
1,Duke,Dartmouth,2024,,0.75,0.222222,0.527778,10,332,-322,...,3.2,-15.8,19.0,17,316,-299,24.84,-16.98,41.82,0
2,Purdue,Samford,2024,,0.878788,0.852941,0.025847,3,89,-86,...,10.9,0.1,10.8,1,55,-54,29.07,9.87,19.2,0


In [64]:
# Generate features for basic stats data
stats_vecs = cbb.gen_basic_features(stats)

# Take a look
print(f'There are {len(stats_vecs)} games in the basic stats dataset.')
print(f'There are {len(cbb.filter_tournament(stats_vecs))} games in the march basic stats dataset.')
stats_vecs.head(3)

There are 5501 games in the basic stats dataset.
There are 5471 games in the march basic stats dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Tm._Fav,Tm.,Tm._Diff,...,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,Kansas,North Carolina Central,2024,,0.6875,0.580645,0.106855,75.25,76.612903,-1.362903,...,14.59375,17.806452,-3.212702,16.15625,18.193548,-2.037298,18.96,-6.74,25.7,0
1,Duke,Dartmouth,2024,,0.75,0.222222,0.527778,79.84375,61.851852,17.991898,...,15.78125,13.851852,1.929398,16.09375,15.962963,0.130787,24.84,-16.98,41.82,0
2,Purdue,Samford,2024,,0.878788,0.852941,0.025847,83.393939,85.970588,-2.576649,...,14.363636,19.029412,-4.665775,20.666667,19.0,1.666667,29.07,9.87,19.2,0


Now that the features for each dataset have been generated, we can join them all to form one larger set of training data that contains all of their features. Since the basic stats dataset only went back to 2010, this larger set of data will be restricted to just the games from 2010 up until now.

Unfortunately, I ran into an issue because the winning percentage data features from the Kenpom and T-Rank datasets appear to be slightly different sometimes. As a temporary fix, I decided to just go with the Kenpom winning percentage for this larger set of data.

In [65]:
# Generate features for each year of data
on_cols_kp_tr = ['Favored', 'Underdog', 'Year', 'Tournament', 'Seed_Fav', 'Seed', 'Label', 'AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']
on_cols_stats = on_cols_kp_tr + ['Win_Loss_Fav', 'Win_Loss', 'Win_Loss_Diff']

In [66]:
# Add an id column to the kenpom dataset
all_vecs = kenpom_vecs[kenpom_vecs['Year'] > 2009]
all_vecs.reset_index(level=0, inplace=True)

# Create a set of training data for years with all features
all_vecs = all_vecs.merge(TRank_vecs[TRank_vecs['Year'] > 2009], on=on_cols_kp_tr)
all_vecs = all_vecs.rename(columns={'Win_Loss_Fav_x': 'Win_Loss_Fav', 'Win_Loss_x': 'Win_Loss', 
                                    'Win_Loss_Diff_x': 'Win_Loss_Diff'})
all_vecs = all_vecs.drop(['Win_Loss_Fav_y', 'Win_Loss_y', 'Win_Loss_Diff_y'], axis=1)
all_vecs = all_vecs.merge(stats_vecs.drop(['ORB', 'ORB_Fav'], axis=1), on=on_cols_stats)
all_vecs = all_vecs.drop_duplicates('index').drop('index', axis=1)

# Take a look
print("There are {} games in the dataset.".format(len(all_vecs)))
print("There are {} games in the march dataset.".format(len(cbb.filter_tournament(all_vecs))))
all_vecs.head()

There are 5501 games in the dataset.
There are 5471 games in the march dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,...,TOV_Diff,TOV_opp_Fav,TOV_opp,TOV_opp_Diff,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff
0,Kansas,North Carolina Central,2024,,0.6875,0.580645,0.106855,23,255,-232,...,-0.40625,11.8125,13.935484,-2.122984,14.59375,17.806452,-3.212702,16.15625,18.193548,-2.037298
1,Duke,Dartmouth,2024,,0.75,0.222222,0.527778,8,336,-328,...,-3.021991,11.4375,9.333333,2.104167,15.78125,13.851852,1.929398,16.09375,15.962963,0.130787
2,Purdue,Samford,2024,,0.878788,0.852941,0.025847,3,81,-78,...,-1.691622,9.757576,16.617647,-6.860071,14.363636,19.029412,-4.665775,20.666667,19.0,1.666667
3,Michigan State,James Madison,2024,,0.575758,0.911765,-0.336007,20,57,-37,...,-0.976827,12.484848,14.441176,-1.956328,16.545455,17.176471,-0.631016,17.060606,17.235294,-0.174688
4,Marquette,Northern Illinois,2024,,0.735294,0.354839,0.380455,13,306,-293,...,-2.768501,14.705882,10.903226,3.802657,15.352941,17.387097,-2.034156,14.647059,18.741935,-4.094877


### Filter Training Set

A problem with our training set here is that many of the games we have data for don't provide useful information about how game in the NCAA Tournament tend to play out. Games between teams that aren't even close to tournament teams are not very predictive of tournament results and therefore we can filter these games out from our training set. 

Again, we previously ran an analysis about how to best filter the training set down to both more closely match actual tournament games without reducing the number of games in the training set too severely. Check out the `covariate_shift.ipynb` notebook for more detailed information.

In [67]:
# First keep all tournament games
t_vecs = cbb.filter_tournament(all_vecs)
r_vecs = cbb.filter_tournament(all_vecs, drop=True)

print("There are {} games in the march dataset.".format(len(t_vecs)))

There are 5471 games in the march dataset.


In [68]:
# Our filtering rule is that each of the features below must be at least in the 
# 2nd percentile of the tournament data
threshold_features = ['Win_Loss', 'AdjEM', 'Barthag']
threshold_quatile = 0.02

In [69]:
thresholds = t_vecs[threshold_features].quantile(threshold_quatile)
thresholds

Win_Loss     0.1250
AdjEM      -22.9000
Barthag      0.0638
Name: 0.02, dtype: float64

In [70]:
# Filter regular season data
thresholds = t_vecs[threshold_features].quantile(threshold_quatile)
is_above_threshold = pd.concat([r_vecs[feat] >= thresh for feat, thresh in thresholds.items()], axis=1)
filtered_r_vecs = r_vecs[is_above_threshold.all(axis=1)]

print("There are {} games in the filtered regular season data".format(len(filtered_r_vecs)))
filtered_r_vecs.head(3)

There are 5357 games in the filtered regular season data


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,...,TOV_Diff,TOV_opp_Fav,TOV_opp,TOV_opp_Diff,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff
0,Kansas,North Carolina Central,2024,,0.6875,0.580645,0.106855,23,255,-232,...,-0.40625,11.8125,13.935484,-2.122984,14.59375,17.806452,-3.212702,16.15625,18.193548,-2.037298
1,Duke,Dartmouth,2024,,0.75,0.222222,0.527778,8,336,-328,...,-3.021991,11.4375,9.333333,2.104167,15.78125,13.851852,1.929398,16.09375,15.962963,0.130787
2,Purdue,Samford,2024,,0.878788,0.852941,0.025847,3,81,-78,...,-1.691622,9.757576,16.617647,-6.860071,14.363636,19.029412,-4.665775,20.666667,19.0,1.666667


In [71]:
# Combine all games in training data
filtered_vecs = pd.concat([filtered_r_vecs, t_vecs])
print("There are {} games in the training data".format(len(filtered_vecs)))

There are 10828 games in the training data


### Feature Reduction

Now that we've joined the data from each data source to a single set of features for each game, we've got a dataset with over 250 columns. At least for the models we'll be looking at, having so many features - many of which are correlated with each other - can be a problem and lead to a less robust model. To help reduce the correlation among the features in our training data, we will choose just the most important features with the least amount of overlap in gained information among them.

Luckily, we've already done an analysis on which features should be in the feature set in the `feature_reduction.ipynb` over in the analysis directory. For more detailed information on why these features were chosen, check out that notebook.

In [72]:
feature_names = ['Win_Loss', 'AdjEM', 'AdjO', 'AdjD', 'AdjT', 'Luck', 'OppAdjEM',
               'NCSOS AdjEM', 'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P%',
               '2P%D', '3P%D', '3PA', '3PA_opp', 'FT%', 'FT%_opp', 'AST', 'AST_opp',
               'BLK', 'BLK_opp']

feature_set = list()
for n in feature_names:
    feature_set.append(n + '_Fav')
    feature_set.append(n)

needed_cols = ['Favored', 'Underdog', 'Year', 'Tournament', 'Label']

In [73]:
feature_vecs = filtered_vecs[needed_cols + feature_set]
feature_vecs.head(3)

Unnamed: 0,Favored,Underdog,Year,Tournament,Label,Win_Loss_Fav,Win_Loss,AdjEM_Fav,AdjEM,AdjO_Fav,...,FT%_opp_Fav,FT%_opp,AST_Fav,AST,AST_opp_Fav,AST_opp,BLK_Fav,BLK,BLK_opp_Fav,BLK_opp
0,Kansas,North Carolina Central,2024,,0,0.6875,0.580645,18.96,-6.74,113.2,...,0.701,0.727,18.8125,12.83871,12.375,12.290323,3.875,2.677419,2.75,3.419355
1,Duke,Dartmouth,2024,,0,0.75,0.222222,24.84,-16.98,121.8,...,0.69,0.706,15.40625,12.296296,12.65625,14.111111,3.6875,3.518519,4.375,2.814815
2,Purdue,Samford,2024,,0,0.878788,0.852941,29.07,9.87,125.0,...,0.724,0.688,18.393939,17.529412,14.424242,13.529412,3.787879,3.823529,2.272727,3.764706


In [75]:
# Save final training set
feature_vecs.to_csv('./Data/Training/training.csv', index=False)