# NBA Game Predictor Model
### CMPE 257 Project
Authors: Kaushika Uppu, Miranda Billawala, Yun Ei Hlaing, Iris Cheung

## Imports

In [158]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns

import random
from datetime import datetime, timedelta
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import mean_squared_error
import itertools


## NBA Game Data

First, we load in all of the NBA game data from the CSV file. Exact code for gathering data is in a separate file and use the nba_api file. Only games from the 1985-1986 season and afterward are loaded in as the seasons before that are missing a very significant portion of the game statistics' data. We also want to be able to map from team id to abbreviation and back easily.

In [98]:
all_stats_cleaned = pd.read_csv('all_stats_cleaned.csv')
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WIN,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,0,ATL,Atlanta Hawks,1986-04-12,1,IND,1,240.0,38,88,...,39,59,22,6,3,12.0,21,108,17.0,1985
1,0,ATL,Atlanta Hawks,1986-04-10,1,NJN,1,240.0,44,87,...,27,42,30,15,5,22.0,26,126,9.0,1985
2,0,ATL,Atlanta Hawks,1986-04-08,1,CHI,1,240.0,52,98,...,25,42,33,13,6,10.0,22,131,13.0,1985
3,0,ATL,Atlanta Hawks,1986-04-05,0,CHI,0,240.0,40,76,...,25,38,17,7,7,21.0,28,97,-5.0,1985
4,0,ATL,Atlanta Hawks,1986-04-04,0,WAS,0,265.0,54,100,...,28,45,24,6,7,14.0,37,129,-6.0,1985


In [99]:
all_stats_cleaned.shape

(89542, 28)

In [100]:
# convert date to datetime object
all_stats_cleaned['GAME_DATE'] = pd.to_datetime(all_stats_cleaned['GAME_DATE'])

In [101]:
team_id_to_abb = {} # dictionary to convert from team_id to team_abbreviation
team_abb_to_id = {} # dictionary to convert from team_abbreviation to team_id

teams = (all_stats_cleaned[['TEAM_ID', 'TEAM_ABBREVIATION']]).drop_duplicates()

for index, row in teams.iterrows() :
    if row['TEAM_ID'] not in team_id_to_abb.keys():
        team_id_to_abb[row['TEAM_ID']] = []
    team_id_to_abb[row['TEAM_ID']].append(row['TEAM_ABBREVIATION'])
    team_abb_to_id[row['TEAM_ABBREVIATION']] = row['TEAM_ID']

### Merging Home and Away Team Stats Into One Row

Currently, each game is represented by two separate rows in the dataset - one for the home team and one for the away team. To make the data more clear, we decided to combine the two rows into a single row with statistics for both teams. Since predicting with our model will pass one set order of team one and team two (i.e. Lakers as Team One, Warriors as Team Two), we want to make sure that the model realizes games with the Lakers as Team Two and Warriors as Team One are more similar than may appear by the data. To do this, we will duplicate the rows and flip the teams. Then, we will have each game listed twice with the teams flipped. 

Firstly, we split the dataset into two : home games and away games. Then, we performed a join on these two datasets, matching each home team with its corresponding opponent based on the same dates. 

In [102]:
home = all_stats_cleaned[all_stats_cleaned.HOME == 1]
away = all_stats_cleaned[all_stats_cleaned.HOME == 0]

In [103]:
combined_stats_home = pd.merge(home, away, 
                          left_on=['GAME_DATE', 'OPPONENT'], 
                          right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('_ONE', '_TWO'))
combined_stats_away = pd.merge(away, home, 
                          left_on=['GAME_DATE', 'OPPONENT'], 
                          right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('_ONE', '_TWO'))

combined_stats = pd.concat([combined_stats_home, combined_stats_away], ignore_index = True)

In [104]:
combined_stats.head(5)

Unnamed: 0,TEAM_ID_ONE,TEAM_ABBREVIATION_ONE,TEAM_NAME_ONE,GAME_DATE,HOME_ONE,OPPONENT_ONE,WIN_ONE,MIN_ONE,FGM_ONE,FGA_ONE,...,DREB_TWO,REB_TWO,AST_TWO,STL_TWO,BLK_TWO,TOV_TWO,PF_TWO,PTS_TWO,PLUS_MINUS_TWO,SEASON_YEAR_TWO
0,0,ATL,Atlanta Hawks,1986-04-12,1,IND,1,240.0,38,88,...,36,43,22,7,3,13.0,33,91,-17.0,1985
1,0,ATL,Atlanta Hawks,1986-04-10,1,NJN,1,240.0,44,87,...,30,44,25,10,1,24.0,30,117,-9.0,1985
2,0,ATL,Atlanta Hawks,1986-04-08,1,CHI,1,240.0,52,98,...,35,44,29,5,1,17.0,26,118,-13.0,1985
3,0,ATL,Atlanta Hawks,1986-04-01,1,WAS,1,240.0,41,90,...,30,46,19,10,6,17.0,22,91,-16.0,1985
4,0,ATL,Atlanta Hawks,1986-03-29,1,CLE,0,240.0,36,84,...,25,33,31,8,5,16.0,32,123,18.0,1985


Comparing the number of rows in the combined dataset to the original shows that the dataset row have been reduced by half, as each game is now represented by a single row instead of two.

After merging the rows, there are some columns that appear twice or are now unneccessary to the dataset. These columns include `MIN_ONE`/`MIN_TWO` (length of game in minutes), `SEASON_YEAR_ONE`/`SEASON_YEAR_TWO`, `OPPONENT_ONE` and `OPPONENT_TWO`.

We first checked if the `MIN_ONE` and `MIN_TWO` for each row has the same values. As seen below, there are 24 games where the minutes differed slightly. However, since the difference did not seem to be significant, we decided to retain one column and rename it `MIN`.

In [105]:
(combined_stats['MIN_ONE'] != combined_stats['MIN_TWO']).sum()

np.int64(48)

In [106]:
combined_stats[combined_stats['MIN_ONE'] != combined_stats['MIN_TWO']][['MIN_ONE','MIN_TWO']]

Unnamed: 0,MIN_ONE,MIN_TWO
455,48.0,47.448
3612,48.0,47.637333
6039,48.0,47.906667
7608,48.0,47.517333
12325,48.0,47.357667
19857,48.0,47.599333
24354,48.0,47.813333
25946,48.0,47.456
30645,53.0,52.906667
32173,47.881,48.0


In [107]:
combined_stats = combined_stats.drop(columns = ['MIN_TWO', 'OPPONENT_ONE', 'OPPONENT_TWO', 'SEASON_YEAR_ONE'])
combined_stats.rename(columns={'MIN_ONE': 'MIN', 'SEASON_YEAR_TWO': 'SEASON_YEAR'}, inplace=True)

## Feature Engineering

Features to add : 
1) Win streak
2) Win percentage
3) ELO Scores
4) EFG%
5) TS%

### Win Streak and Win Percentage

In [12]:
def add_win_streak_and_percentage(df, combined=False):
    """
    Input: Dataframe with team one and team two data for each game and boolean to check if dataframe is combined with both team data
    Output: New dataframe with added win streak and win percentage for both teams
    """
    team_date_stats = all_stats_cleaned[['TEAM_ID', 'GAME_DATE', 'WIN']].sort_values(by=['TEAM_ID', 'GAME_DATE']).reset_index(drop=True)
    team_date_stats['WIN_STREAK'] = 0
    team_date_stats['WIN_PERCENTAGE'] = 0.0
    
    for team_id, group in team_date_stats.groupby('TEAM_ID'):
        streak = 0
        wins = 0
        total_games = 0
        indices = group.index
    
        for i in range(len(indices)):
            idx = indices[i]
    
            # WIN STREAK
            team_date_stats.at[idx, 'WIN_STREAK'] = streak
    
            if team_date_stats.at[idx, 'WIN'] == 1:
                streak += 1
            else: 
                streak = 0
    
            # WIN PERCENTAGE
            if total_games == 0:
                team_date_stats.at[idx, 'WIN_PERCENTAGE'] = 0.0
            else: 
                team_date_stats.at[idx, 'WIN_PERCENTAGE'] = wins / total_games
    
            total_games += 1
            if team_date_stats.at[idx, 'WIN'] == 1:
                wins += 1

    if combined:
    # Join Win streak and Win percentage of team one and team two into the merged table
        team_date_stats.drop('WIN', axis=1, inplace=True)
        df = pd.merge(df, team_date_stats,
                              how='left', 
                              left_on = ['TEAM_ID_ONE', 'GAME_DATE'],
                              right_on=['TEAM_ID', 'GAME_DATE'])
        df.drop('TEAM_ID', axis=1, inplace=True)
        df.rename(columns = {'WIN_STREAK': 'WIN_STREAK_ONE',
                                     'WIN_PERCENTAGE': 'WIN_PERCENTAGE_ONE'}, inplace=True)
        df = pd.merge(df, team_date_stats,
                              how='left', 
                              left_on = ['TEAM_ID_TWO', 'GAME_DATE'],
                              right_on=['TEAM_ID', 'GAME_DATE'])
        df.drop('TEAM_ID', axis=1, inplace=True)
        df.rename(columns = {'WIN_STREAK': 'WIN_STREAK_TWO',
                                     'WIN_PERCENTAGE': 'WIN_PERCENTAGE_TWO'}, inplace=True)
    else:
        # Join Win streak and Win percentage into the dataframe
        team_date_stats.drop('WIN', axis=1, inplace=True)
        df = pd.merge(df, team_date_stats,
                              how='left', 
                              on = ['TEAM_ID', 'GAME_DATE'])
    
    return df

### ELO Score Before Current Game

In [13]:
def merge_opponent_points(df):
    df_opp = df[['TEAM_ABBREVIATION', 'GAME_DATE', 'PTS', 'TEAM_ID']].copy()
    merged_df = pd.merge(df, df_opp, 
                         how='left',
                          left_on=['GAME_DATE', 'OPPONENT'],
                            right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('', '_OPPONENT'))
    merged_df.drop(columns=['TEAM_ABBREVIATION_OPPONENT'], inplace=True)
    return merged_df

In [14]:
def add_elo_score(df, combined=False):
    """
    Input: Dataframe with team one and team two data for each game and boolean to check if dataframe is combined with both team data
    Output: New dataframe with elo scores for both teams added 
    """
    if combined:
        df['GAME_ID'] = df.apply(
        lambda row: '_'.join(sorted([str(row['TEAM_ID_ONE']), str(row['TEAM_ID_TWO'])]) + [str(row['GAME_DATE'])]),
        axis=1
    )
        df['ELO_ONE'] = np.nan
        df['ELO_TWO'] = np.nan
    else:
        df = merge_opponent_points(df)
        df['ELO'] = np.nan
        df['GAME_ID'] = df.apply(
        lambda row: '_'.join(sorted([str(row['TEAM_ID']), str(row['TEAM_ID_OPPONENT'])]) + [str(row['GAME_DATE'])]),
        axis=1
    )
    
    team_elos = {} # to use for checking if a team has appeared and track team last elo scores
    team_last_season = {} # to track last seasons of teams
    processed_games = set() # to track game id - handle duplicate game columns
    elo_map = {} # for faster computation
    df = df.sort_values(by='GAME_DATE').reset_index(drop=True)
    
    for i,row in df.iterrows():
        season = row['SEASON_YEAR']
        game_id = row['GAME_ID']

        if game_id in processed_games:
            continue
        processed_games.add(game_id)

        if combined:
            team_one, team_two = row['TEAM_ID_ONE'], row['TEAM_ID_TWO']
            points_one, points_two = row['PTS_ONE'], row['PTS_TWO']
            home_one = row['HOME_ONE']
        
            # Season adjustment formula for ELO : New Season ELO = 0.75 * Last Season ELO + 0.25 * Mean ELO, Mean ELO = 1505
            for team in [team_one, team_two]:
                # check if team has not appeared yet in the dataset
                if team not in team_elos:
                    team_elos[team] = 1505 
                    team_last_season[team] = season
                # check for new season, if yes, apply season adjustment
                elif team_last_season[team] != season:
                    team_elos[team] = 0.75 * team_elos[team] + 0.25 * 1505
                    team_last_season[team] = season
        
            # elo scores before game
            elo_one = team_elos[team_one]
            elo_two = team_elos[team_two]
        
            # Add 100 score to home team
            if home_one == 1:
                elo_one_after_home_adv = elo_one + 100 
                elo_two_after_home_adv = elo_two
            else:
                elo_one_after_home_adv = elo_one 
                elo_two_after_home_adv = elo_two + 100
        
            # Expected score of game formula : exp = 1/ (1+10^((ELO two after home advantage - ELO one after home advantage) / 400))
            exp = 1/ (1+10**((elo_two_after_home_adv - elo_one_after_home_adv) / 400))
        
            actual = 1 if points_one > points_two else 0
            margin_of_victory = abs(points_one - points_two)
        
            # Margin of Victory Multiplier formula : ((MOV + 3) ** 0.8) / (7.5 + 0.006 * (Elo team one - Elo team two))
            MOVM = ((margin_of_victory + 3) ** 0.8) / (7.5 + 0.006 * (elo_one - elo_two))
        
            # change in ELO: K * MOVM * (actual - exp), k -> attenuation factor -> higher means elo score adjusts quickly to changes in strength of team
            K = 20 # 20 is optimal for nba 
            change = K * MOVM * (actual - exp)
    
            # Update data for ELO ratings
            team_elos[team_one] += change
            team_elos[team_two] -= change
        
            # store elo score for game id at the table
            # df.at[i, 'ELO_ONE'] = elo_one
            # df.at[i, 'ELO_TWO'] = elo_two
            # df.loc[(df['GAME_ID'] == game_id) & df['TEAM_ID_ONE'] == team_two, 'ELO_ONE'] = elo_two
            # df.loc[(df['GAME_ID'] == game_id) & df['TEAM_ID_TWO'] == team_one, 'ELO_TWO'] = elo_one

            # store elo scores in dictionary
            elo_map[(game_id, team_one, team_two)] = elo_one
            elo_map[(game_id, team_two, team_one)] = elo_two
     
        else:
            team, team_opp = row['TEAM_ID'], row['TEAM_ID_OPPONENT']
            points_team, points_opp = row['PTS'], row['PTS_OPPONENT']
            home = row['HOME']
        
            # Season adjustment formula for ELO : New Season ELO = 0.75 * Last Season ELO + 0.25 * Mean ELO, Mean ELO = 1505
            for t in [team, team_opp]:
                # check if team has not appeared yet in the dataset
                if t not in team_elos:
                    team_elos[t] = 1505 
                    team_last_season[t] = season
                # check for new season, if yes, apply season adjustment
                elif team_last_season[t] != season:
                    team_elos[t] = 0.75 * team_elos[t] + 0.25 * 1505
                    team_last_season[t] = season
        
            # elo scores before game
            elo_team = team_elos[team]
            elo_opponent = team_elos[team_opp]
        
            # Add 100 score to home team
            if home == 1:
                elo_team_home = elo_team + 100 
                elo_opp_home = elo_opponent
            else:
                elo_team_home = elo_team 
                elo_opp_home = elo_opponent + 100
        
            # Expected score of game formula : exp = 1/ (1+10^((ELO two after home advantage - ELO one after home advantage) / 400))
            exp = 1/ (1+10**((elo_opp_home - elo_team_home) / 400))
        
            actual = 1 if points_team > points_opp else 0
            margin_of_victory = abs(points_team - points_opp)
        
            # Margin of Victory Multiplier formula : ((MOV + 3) ** 0.8) / (7.5 + 0.006 * (Elo team one - Elo team two))
            MOVM = ((margin_of_victory + 3) ** 0.8) / (7.5 + 0.006 * (elo_team - elo_opponent))
        
            # change in ELO: K * MOVM * (actual - exp), k -> attenuation factor -> higher means elo score adjusts quickly to changes in strength of team
            K = 20 # 20 is optimal for nba 
            change = K * MOVM * (actual - exp)

            # Update data for ELO ratings
            team_elos[team] += change
            team_elos[team_opp] -= change
        
            # store elo score for both row of game at the table
            # df.at[i, 'ELO'] = elo_team
            # df.loc[(df['GAME_ID'] == game_id) & df['TEAM_ID'] == team_opp, 'ELO'] = elo_opponent
            elo_map[(game_id, team)] = elo_team
            elo_map[(game_id, team_opp)] = elo_opponent

    # add data from elo dictionary into dataframe
    if not combined:
        df['ELO'] = df.apply(lambda x: elo_map.get((x['GAME_ID'], x['TEAM_ID']), np.nan), axis=1)
        df.drop(columns=['PTS_OPPONENT', 'TEAM_ID_OPPONENT'], axis=1, inplace=True)
    else: 
        df['ELO_ONE'] = df.apply(lambda x: elo_map.get((x['GAME_ID'], x['TEAM_ID_ONE'], x['TEAM_ID_TWO']), np.nan), axis=1)
        df['ELO_TWO'] = df.apply(lambda x: elo_map.get((x['GAME_ID'], x['TEAM_ID_TWO'], x['TEAM_ID_ONE']), np.nan), axis=1)
    df.drop(columns=['GAME_ID'], axis=1, inplace=True)
    
            
    return df                                   

In [15]:
# test for single team data
test_1 = add_win_streak_and_percentage(all_stats_cleaned)
test_1 = add_elo_score(test_1)
print(test_1.columns)
test_1[test_1['TEAM_ID'] == 14].head(5)

Index(['TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_DATE', 'HOME',
       'OPPONENT', 'WIN', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A',
       'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'SEASON_YEAR', 'WIN_STREAK',
       'WIN_PERCENTAGE', 'ELO'],
      dtype='object')


Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WIN,MIN,FGM,FGA,...,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR,WIN_STREAK,WIN_PERCENTAGE,ELO
0,14,NJN,New Jersey Nets,1985-10-25,1,BOS,1,265.0,44,101,...,14,3,24.0,30,113,4.0,1985,0,0.0,1505.0
21,14,NJN,New Jersey Nets,1985-10-26,0,IND,0,240.0,31,79,...,10,0,19.0,28,92,-27.0,1985,1,1.0,1509.552723
54,14,NJN,New Jersey Nets,1985-10-29,0,DET,0,240.0,40,98,...,6,4,20.0,39,107,-17.0,1985,0,0.5,1494.776564
58,14,NJN,New Jersey Nets,1985-10-30,1,IND,1,315.0,55,121,...,14,6,20.0,36,147,9.0,1985,0,0.333333,1484.682913
82,14,NJN,New Jersey Nets,1985-11-01,1,PHL,1,240.0,45,86,...,15,6,21.0,20,106,4.0,1985,1,0.5,1492.848408


In [16]:
# test for combined team and opponent data
test_2 = add_win_streak_and_percentage(combined_stats, True)
test_2 = add_elo_score(test_2, True)
test_2[test_2['TEAM_ID_ONE'] == 14].head(5)

Unnamed: 0,TEAM_ID_ONE,TEAM_ABBREVIATION_ONE,TEAM_NAME_ONE,GAME_DATE,HOME_ONE,WIN_ONE,MIN,FGM_ONE,FGA_ONE,FG_PCT_ONE,...,PF_TWO,PTS_TWO,PLUS_MINUS_TWO,SEASON_YEAR,WIN_STREAK_ONE,WIN_PERCENTAGE_ONE,WIN_STREAK_TWO,WIN_PERCENTAGE_TWO,ELO_ONE,ELO_TWO
11,14,NJN,New Jersey Nets,1985-10-25,1,1,265.0,44,101,0.436,...,27,109,-4.0,1985,0,0.0,0,0.0,1505.0,1505.0
16,14,NJN,New Jersey Nets,1985-10-26,0,0,240.0,31,79,0.392,...,27,119,27.0,1985,1,1.0,0,0.0,1509.552723,1505.0
42,14,NJN,New Jersey Nets,1985-10-29,0,0,240.0,40,98,0.408,...,34,124,17.0,1985,0,0.5,0,0.5,1494.776564,1504.443525
59,14,NJN,New Jersey Nets,1985-10-30,1,1,315.0,55,121,0.455,...,35,138,-9.0,1985,0,0.333333,1,1.0,1484.682913,1519.776159
80,14,NJN,New Jersey Nets,1985-11-01,1,1,240.0,45,86,0.523,...,20,102,-4.0,1985,1,0.5,1,0.666667,1492.848408,1520.826747


### Effective Field Goal Percentage and True Shooting Percentage

In [17]:
def add_shooting_percentages(df, combined=False):
    if combined: 
        df['EFG%_ONE'] = (df['FGM_ONE'] + 1.5 * df['FG3M_ONE']) / df['FGA_ONE']
        df['EFG%_TWO'] = (df['FGM_TWO'] + 1.5 * df['FG3M_TWO']) / df['FGA_TWO']
        df['TS%_ONE'] = df['PTS_ONE'] / (2 * (df['FGA_ONE'] + 0.44 * df['FTA_ONE']))
        df['TS%_TWO'] = df['PTS_TWO'] / (2 * (df['FGA_TWO'] + 0.44 * df['FTA_TWO']))
    else:
        df['EFG%'] = (df['FGM'] + 1.5 * df['FG3M']) / df['FGA']
        df['TS%'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA']))
    return df    

### Point Differential

In [18]:
def add_point_differential(df, window = 5, combined=False):
        #  add opponent points to all_stats_cleaned table
    #for team_id in all_stats_cleaned['TEAM_ID'].unique() :
    #    team_data = all_stats_cleaned[all_stats_cleaned['TEAM_ID'] == team_id].sort_values(by='GAME_DATE')
    #    for col in cols :
    #        shift = team_data[col].shift(1)
    #        team_data[col] = shift.rolling(window = n).mean()
    #    if result is None :
    #        result = team_data
    #    else :
    #        result = pd.concat([result, team_data])
    
    
    if combined:
        df['PTS_DIFF_ONE'] = df['PTS_ONE'] - df['PTS_TWO']
        df['PTS_DIFF_TWO'] = df['PTS_TWO'] - df['PTS_ONE']
    else:
        df = merge_opponent_points(df)
        df['PTS_DIFF'] = df['PTS'] - df['PTS_OPPONENT']
        df.drop(columns=['PTS_OPPONENT', 'TEAM_ID_OPPONENT'], axis=1, inplace=True)
    return df

### Win for Last Matchup Game

In [77]:
def add_win_last_game(df, combined=False):
    if combined:
        sorted_df = df.sort_values(by=['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE'])
        sorted_df['WIN_LAST_ONE'] = sorted_df.groupby(['TEAM_ID_ONE', 'TEAM_ID_TWO'])['WIN_ONE'].shift(1)
        sorted_df['WIN_LAST_TWO'] = sorted_df.groupby(['TEAM_ID_ONE', 'TEAM_ID_TWO'])['WIN_TWO'].shift(1)
        df = df.merge(sorted_df[['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE', 'WIN_LAST_ONE', 'WIN_LAST_TWO']],
                      on=['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE'],
                      how = 'left')
    else:
        sorted_df = df.sort_values(by=['TEAM_ID', 'OPPONENT', 'GAME_DATE'])
        sorted_df['WIN_LAST'] = sorted_df.groupby(['TEAM_ID', 'OPPONENT'])['WIN'].shift(1)
        df = df.merge(sorted_df[['TEAM_ID', 'OPPONENT', 'GAME_DATE', 'WIN_LAST']],
                      on=['TEAM_ID', 'OPPONENT', 'GAME_DATE'],
                      how = 'left')
    
    return df

In [45]:
# test
test_4 = add_win_last_game(all_stats_cleaned)
print(test_4.columns)
test_4[(test_4['TEAM_ID'] == 14) & (test_4['OPPONENT'] == 'BOS')].sort_values(by='GAME_DATE').head(5)

Index(['TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_DATE', 'HOME',
       'OPPONENT', 'WIN', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A',
       'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'SEASON_YEAR', 'WIN_LAST'],
      dtype='object')


Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WIN,MIN,FGM,FGA,...,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR,WIN_LAST
41935,14,NJN,New Jersey Nets,1985-10-25,1,BOS,1,265.0,44,101,...,52,17,14,3,24.0,30,113,4.0,1985,
41916,14,NJN,New Jersey Nets,1985-12-04,1,BOS,0,240.0,46,91,...,38,23,10,1,21.0,22,111,-19.0,1985,1.0
41902,14,NJN,New Jersey Nets,1986-01-03,0,BOS,0,240.0,50,96,...,43,34,9,5,20.0,19,117,-12.0,1985,0.0
41860,14,NJN,New Jersey Nets,1986-03-30,0,BOS,0,240.0,49,100,...,49,32,10,2,13.0,22,117,-5.0,1985,0.0
41856,14,NJN,New Jersey Nets,1986-04-09,1,BOS,1,240.0,41,95,...,57,20,9,1,10.0,22,108,10.0,1985,0.0


## Building Training Set

### Train-Test Split
Since our data follows a time-series format, we are implementing a different method of testing as follows:
1. Save the last 4 years of games as a test set.
2. Choose a few days from each season, except the last four, for our validation set.
3. Run a GridSearchCV for each day of validation games to find best parameters.
4. Test the model with best parameters on the test set.

We need to retrain the model for each different day we test because we input into the model the number of days since a game occurred as a way to convert the timestamp into a numerical variable that is understandable to the model. 

For validating and choosing proper parameters, we will test on three days from each season: beginning, after the trade-deadline (middle), and near playoffs. This way, we can see how the model handles different times of the season. 

In [21]:
def get_val_set (first_season, last_season, n = 1) :
    dates = []
    for season in range(first_season, last_season) :
        season_data = all_stats_cleaned[all_stats_cleaned['SEASON_YEAR'] == season]
        start_date = season_data['GAME_DATE'].min()
        end_date = season_data['GAME_DATE'].max()

        # day around the beginning of the season
        beg = season_data[season_data['GAME_DATE'].between(start_date, start_date + timedelta(weeks = 4))]

        # day around trade deadline (after about 2/3 of the season)
        delta = round((2/3)*(end_date-start_date).days)
        approx_deadline = start_date + timedelta(days = delta)
        mid = season_data[season_data['GAME_DATE'].between(approx_deadline, approx_deadline + timedelta(weeks = 4))]
        
        # day around the end of the season
        end = season_data[season_data['GAME_DATE'].between(end_date - timedelta(weeks = 4), end_date)]

        dates.extend(list(pd.concat([beg.sample(n)['GAME_DATE'], mid.sample(n)['GAME_DATE'], end.sample(n)['GAME_DATE']])))

    return dates

In [22]:
first_season = all_stats_cleaned['SEASON_YEAR'].min() + 1
last_season = all_stats_cleaned['SEASON_YEAR'].max() - 4
val_set = get_val_set(first_season, last_season)

We attempt two different methods for predicting game statistics. As a baseline, we use a regular rolling window. Then, we implement a model which predicts a team's statistics. We use both of these values to test an outcome predictor model after.

In [23]:
# added shooting percentage
all_stats_cleaned = add_shooting_percentages(all_stats_cleaned)
# added win streak and win percentage
all_stats_cleaned = add_win_streak_and_percentage(all_stats_cleaned)
# added ELO score
all_stats_cleaned = add_elo_score(all_stats_cleaned)
# added point differential
# all_stats_cleaned = add_point_differential(all_stats_cleaned)
# added win for last game
all_stats_cleaned = add_win_last_game(all_stats_cleaned)

In [24]:
def rolling_window(n, cols) :
    pred = None
    for team_id in all_stats_cleaned['TEAM_ID'].unique() :
        team_data = all_stats_cleaned[all_stats_cleaned['TEAM_ID'] == team_id].sort_values(by='GAME_DATE')
        for col in cols :
            shift = team_data[col].shift(1)
            team_data[col] = shift.rolling(window = n).mean()
        if pred is None :
            pred = team_data
        else :
            pred = pd.concat([pred, team_data])
    pred = pred.dropna(axis = 0)

    home = pred[pred['HOME'] == 1]
    away = pred[pred['HOME'] == 0]

    combined_pred_stats_home = pd.merge(home, away, 
                          left_on=['GAME_DATE', 'OPPONENT'], 
                          right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('_ONE', '_TWO'))
    combined_pred_stats_away = pd.merge(away, home, 
                          left_on=['GAME_DATE', 'OPPONENT'], 
                          right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('_ONE', '_TWO'))

    combined_pred_stats = pd.concat([combined_pred_stats_home, combined_pred_stats_away], ignore_index = True)
    combined_pred_stats.rename(columns={'MIN_ONE': 'MIN', 'SEASON_YEAR_TWO': 'SEASON_YEAR'}, inplace=True)
    combined_pred_stats = combined_pred_stats.drop(columns = ['MIN_TWO', 'OPPONENT_ONE', 'OPPONENT_TWO', 'SEASON_YEAR_ONE', 
                                                              'TEAM_ABBREVIATION_ONE', 'TEAM_NAME_ONE', 'MIN', 'FGM_ONE', 
                                                              'FGA_ONE', 'FG3M_ONE', 'FG3A_ONE', 'FTM_ONE', 'FTA_ONE', 'PTS_ONE', 
                                                              'PLUS_MINUS_ONE', 'TEAM_ABBREVIATION_TWO', 'TEAM_NAME_TWO', 'HOME_TWO',
                                                              'WIN_TWO', 'FGM_TWO', 'FGA_TWO', 'FG3M_TWO', 'FG3A_TWO', 'FTM_TWO', 
                                                              'FTA_TWO', 'PTS_TWO', 'PLUS_MINUS_TWO'])

    return combined_pred_stats

### Rolling Window Statistics (Baseline)

In [25]:
cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'EFG%', 'TS%']
combined_pred_stats = rolling_window(5, cols)
combined_pred_stats.head()

Unnamed: 0,TEAM_ID_ONE,GAME_DATE,HOME_ONE,WIN_ONE,FG_PCT_ONE,FG3_PCT_ONE,FT_PCT_ONE,OREB_ONE,DREB_ONE,REB_ONE,...,BLK_TWO,TOV_TWO,PF_TWO,SEASON_YEAR,EFG%_TWO,TS%_TWO,WIN_STREAK_TWO,WIN_PERCENTAGE_TWO,ELO_TWO,WIN_LAST_TWO
0,14,1985-11-09,1,1,0.477,0.4,0.7348,14.6,34.2,48.8,...,5.0,20.6,29.4,1985,0.512,0.540427,4,0.75,1555.712573,1.0
1,14,1985-11-27,1,0,0.5222,0.4,0.7618,13.6,26.8,40.4,...,5.0,17.2,20.8,1985,0.496664,0.539412,0,0.428571,1485.199131,0.0
2,14,1985-12-04,1,0,0.5022,0.3,0.7216,12.4,29.6,42.0,...,7.6,17.2,22.0,1985,0.509258,0.56303,8,0.888889,1616.849213,0.0
3,14,1985-12-07,1,1,0.4696,0.3,0.7154,13.2,29.6,42.8,...,5.8,18.2,22.8,1985,0.492694,0.542552,1,0.565217,1522.94658,0.0
4,14,1985-12-10,1,1,0.4842,0.2,0.7518,12.8,31.2,44.0,...,3.2,19.8,28.2,1985,0.497839,0.54526,0,0.333333,1428.104711,0.0


### Predicting Using ML Model

In [26]:
# get actual stats
combined_stats_training = add_shooting_percentages(combined_stats, combined = True)
combined_stats_training = combined_stats[['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE', 'FG_PCT_ONE',
                                          'FG3_PCT_ONE','FT_PCT_ONE', 'OREB_ONE', 'DREB_ONE', 'REB_ONE',
                                          'AST_ONE', 'STL_ONE', 'BLK_ONE', 'TOV_ONE', 'PF_ONE', 'EFG%_ONE', 'TS%_ONE']]

In [27]:
# get rolling window stats
cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'EFG%', 'TS%']
rolling_stats_training = rolling_window(10, cols)

In [28]:
# combine 
model_training_set = pd.merge(rolling_stats_training, combined_stats_training, 
                          left_on=['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE'], 
                          right_on=['TEAM_ID_ONE', 'TEAM_ID_TWO', 'GAME_DATE'],
                          suffixes=('_PRED', '_ACT'))

In [172]:
act_cols = ['FG_PCT_ONE_ACT', 'FG3_PCT_ONE_ACT', 'FT_PCT_ONE_ACT', 'OREB_ONE_ACT', 'DREB_ONE_ACT', 
                'REB_ONE_ACT','AST_ONE_ACT', 'STL_ONE_ACT', 'BLK_ONE_ACT', 'TOV_ONE_ACT', 
                'PF_ONE_ACT', 'EFG%_ONE_ACT', 'TS%_ONE_ACT'] 

def train_model(df, team_id, game_date, model_params = None):
    """
    Trains a model to predict team stats for a given game using past rolling averages of both teams.
    Features: past performance of TEAM_ONE and TEAM_TWO.
    Targets: actual stats of TEAM_ONE in the current game.
    """
    df = df[(df['GAME_DATE'] < game_date) & (df['TEAM_ID_ONE'] == team_id)]
    X = df.drop(columns = act_cols+['GAME_DATE'])

    # fitting a XGBoost model for each stat
    models = {}
    for col in act_cols:
        y = df[col]
        if model_params is None :
            model = XGBRegressor(n_estimators = 100, random_state = 33)
        else :
            model = XGBRegressor(**model_params[col], random_state = 33)
        model.fit(X, y)
        models[col] = model

    return models

def predict_game_stats(df, team_id, game_date, model_params = None) :
    """
    Predicts the statistics for given game.
    """
    df = pd.get_dummies(df, columns=['TEAM_ID_TWO'], drop_first=True)
    if model_params is None :
        models = train_model(df, team_id, game_date)
    else :
        models = train_model(df, team_id, game_date, model_params)

    pred = df[(df['GAME_DATE'] == game_date) & (df['TEAM_ID_ONE'] == team_id)].drop(columns = act_cols+['GAME_DATE'])
    
    prediction = {}
    for stat, model in models.items():
        prediction[stat] = model.predict(pred)[0]
   
    return prediction

def evaluate_stats_model(df, test_set, model_params = None):
    """
    Evaluates predicting stats model by testing on last `test_seasons` seasons using RMSE.
    """
    predictions = []
    actuals = []

    for day in test_set :
        print("Predicting...", day)
        games_on_day = df[df['GAME_DATE'] == day]
        for index, row in games_on_day.iterrows() :
            if model_params is None :
                pred = predict_game_stats(df, row['TEAM_ID_ONE'], day)
            else :
                pred = predict_game_stats(df, row['TEAM_ID_ONE'], day, model_params)
            pred = [pred[col] for col in act_cols]
            act = [row[col] for col in act_cols]
            predictions.append(pred)
            actuals.append(act)

    # evaluating model's predictions
    y_true = np.array(actuals)
    y_pred = np.array(predictions)
    total_rmse = np.sqrt(mean_squared_error(y_true.flatten(), y_pred.flatten()))
    return total_rmse

In [173]:
rmse = evaluate_stats_model(model_training_set, val_set)

Predicting... 1986-11-01 00:00:00


KeyboardInterrupt: 

In [90]:
rmse # np.float64(3.7292723986015033)

np.float64(3.7292723986015033)

### Hyperparameter Tuning
To perform hyperparameter tuning, we are going to look at only a small subset of the validation set since each game to be predicted requires fitting a number of different models. For computational efficiency, we are going to make the validation subset include only dates from 2018.

In [174]:
param_test_set = [d for d in val_set if d.year == 2018]
param_test_set

[Timestamp('2018-02-24 00:00:00'),
 Timestamp('2018-03-15 00:00:00'),
 Timestamp('2018-11-08 00:00:00')]

In [212]:
def hyperparameter_tuning (df, params, test_set) :
    index = 1
    param_perf = None
    for p in params :
        print(f"Iteration {index} / {len(params)}")
        predictions = None 
        actual = None
        for day in test_set :
            games_on_day = df[df['GAME_DATE'] == day]
            for _, row in games_on_day.iterrows() :
                model_params = {col : p for col in act_cols}
                pred = pd.DataFrame([predict_game_stats(df, row['TEAM_ID_ONE'], day, model_params)])
                act = pd.DataFrame([{col : row[col] for col in act_cols}])
                predictions = pred if predictions is None else pd.concat([predictions, pred], ignore_index = True)
                actual = act if predictions is None else pd.concat([actual, act], ignore_index = True)

        scores = {'params': p}
        for col in act_cols :
            scores[col] = np.sqrt(mean_squared_error(predictions[col], actual[col]))
        scores = pd.DataFrame([scores])
        param_perf = scores if param_perf is None else pd.concat([param_perf, scores], ignore_index = True)
        index += 1

    best_params = {}
    for col in act_cols :
        best_params[col] = param_perf.loc[param_perf[col].idxmin(), 'params']
    return best_params

In [213]:
param_grid = {
    "n_estimators": [50, 100, 150],
    "eta": [0.01, 0.05, 0.1], # learning_rate
    "max_depth": [4, 6, 8], # maximum depth of a tree
    "subsample": [0.5, 0.7, 1], # fraction of observation to be radnomly sampled for each tree
    "colsample_bytree": [0.5, 0.7, 1], # fraction of columns to be random samples for each tree
    }

params = []
# Iterate over all combinations of hyperparameters
for values in itertools.product(*param_grid.values()):
    params.append(dict(zip(param_grid.keys(), values)))

best_params = hyperparameter_tuning(model_training_set, params, param_test_set)

Iteration 1 / 243
Iteration 2 / 243
Iteration 3 / 243
Iteration 4 / 243
Iteration 5 / 243
Iteration 6 / 243
Iteration 7 / 243
Iteration 8 / 243
Iteration 9 / 243
Iteration 10 / 243
Iteration 11 / 243
Iteration 12 / 243
Iteration 13 / 243
Iteration 14 / 243
Iteration 15 / 243
Iteration 16 / 243
Iteration 17 / 243
Iteration 18 / 243
Iteration 19 / 243
Iteration 20 / 243
Iteration 21 / 243
Iteration 22 / 243
Iteration 23 / 243
Iteration 24 / 243
Iteration 25 / 243
Iteration 26 / 243
Iteration 27 / 243
Iteration 28 / 243
Iteration 29 / 243
Iteration 30 / 243
Iteration 31 / 243
Iteration 32 / 243
Iteration 33 / 243
Iteration 34 / 243
Iteration 35 / 243
Iteration 36 / 243
Iteration 37 / 243
Iteration 38 / 243
Iteration 39 / 243
Iteration 40 / 243
Iteration 41 / 243
Iteration 42 / 243
Iteration 43 / 243
Iteration 44 / 243
Iteration 45 / 243
Iteration 46 / 243
Iteration 47 / 243
Iteration 48 / 243
Iteration 49 / 243
Iteration 50 / 243
Iteration 51 / 243
Iteration 52 / 243
Iteration 53 / 243
It

In [214]:
for k, v in best_params.items():
    print(k,v)

#FG_PCT_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.7, 'colsample_bytree': 0.5}
#FG3_PCT_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 6, 'subsample': 0.7, 'colsample_bytree': 1}
#FT_PCT_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 4, 'subsample': 0.5, 'colsample_bytree': 0.5}
#OREB_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
#DREB_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 4, 'subsample': 0.5, 'colsample_bytree': 1}
#REB_ONE_ACT {'n_estimators': 150, 'eta': 0.05, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.7}
#AST_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
#STL_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
#BLK_ONE_ACT {'n_estimators': 150, 'eta': 0.1, 'max_depth': 6, 'subsample': 0.7, 'colsample_bytree': 1}
#TOV_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 6, 'subsample': 0.7, 'colsample_bytree': 1}
#PF_ONE_ACT {'n_estimators': 150, 'eta': 0.05, 'max_depth': 4, 'subsample': 0.7, 'colsample_bytree': 1}
#EFG%_ONE_ACT {'n_estimators': 50, 'eta': 0.05, 'max_depth': 8, 'subsample': 0.7, 'colsample_bytree': 1}
#TS%_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 6, 'subsample': 1, 'colsample_bytree': 0.5}

FG_PCT_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.7, 'colsample_bytree': 0.5}
FG3_PCT_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 6, 'subsample': 0.7, 'colsample_bytree': 1}
FT_PCT_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 4, 'subsample': 0.5, 'colsample_bytree': 0.5}
OREB_ONE_ACT {'n_estimators': 50, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
DREB_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 4, 'subsample': 0.5, 'colsample_bytree': 1}
REB_ONE_ACT {'n_estimators': 150, 'eta': 0.05, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.7}
AST_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
STL_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_depth': 8, 'subsample': 0.5, 'colsample_bytree': 0.5}
BLK_ONE_ACT {'n_estimators': 150, 'eta': 0.1, 'max_depth': 6, 'subsample': 0.7, 'colsample_bytree': 1}
TOV_ONE_ACT {'n_estimators': 100, 'eta': 0.1, 'max_

In [215]:
rmse_tuned = evaluate_stats_model(model_training_set, val_set, best_params)

Predicting... 1986-11-01 00:00:00
Predicting... 1987-03-07 00:00:00
Predicting... 1987-04-17 00:00:00
Predicting... 1987-11-25 00:00:00
Predicting... 1988-03-11 00:00:00
Predicting... 1988-04-02 00:00:00
Predicting... 1988-11-04 00:00:00
Predicting... 1989-03-23 00:00:00
Predicting... 1989-04-15 00:00:00
Predicting... 1989-11-24 00:00:00
Predicting... 1990-03-16 00:00:00
Predicting... 1990-03-30 00:00:00
Predicting... 1990-11-24 00:00:00
Predicting... 1991-03-09 00:00:00
Predicting... 1991-04-14 00:00:00
Predicting... 1991-11-06 00:00:00
Predicting... 1992-03-11 00:00:00
Predicting... 1992-03-31 00:00:00
Predicting... 1992-11-11 00:00:00
Predicting... 1993-03-02 00:00:00
Predicting... 1993-04-10 00:00:00
Predicting... 1993-11-17 00:00:00
Predicting... 1994-03-24 00:00:00
Predicting... 1994-04-21 00:00:00
Predicting... 1994-11-30 00:00:00
Predicting... 1995-03-24 00:00:00
Predicting... 1995-04-15 00:00:00
Predicting... 1995-11-08 00:00:00
Predicting... 1996-02-25 00:00:00
Predicting... 

In [216]:
rmse_tuned

np.float64(3.509182892225601)

### Predict Training Set for Outcome Model

In [30]:
static_cols = ['TEAM_ID_ONE', 'SEASON_YEAR', 'HOME_ONE', 'WIN_ONE', 'ELO_ONE', 'WIN_STREAK_ONE', 'WIN_PERCENTAGE_ONE', 'WIN_LAST_ONE']
def pred_training_set (df, first_season, last_season) :
    all_predictions = None
    days = all_stats_cleaned[all_stats_cleaned['SEASON_YEAR'].between(first_season, last_season)]['GAME_DATE'].unique()
    rows = []
    current_day = 1
    total_days = len(days)
    for d in days:
        print(f"Predicting Day {current_day} / {total_days} ")
        games_on_day = df[df['GAME_DATE'] == d]
        for _, row in games_on_day.iterrows() :
            pred = predict_game_stats(df, row['TEAM_ID_ONE'], d)
            pred['GAME_DATE'] = d
            pred['OPP'] = row['TEAM_ID_TWO']
            for s in static_cols :
                pred[s] = row[s]
            rows.append(pred)
        current_day += 1
            
    all_predictions = pd.DataFrame(rows)
    all_predictions.rename(columns=lambda col: col.replace('_ONE', ''), inplace=True)
    all_predictions.rename(columns=lambda col: col.replace('_ACT', ''), inplace=True)

    home = all_predictions[all_predictions.HOME == 1]
    away = all_predictions[all_predictions.HOME == 0]

    combined_pred_home = pd.merge(home, away, 
                          left_on=['GAME_DATE', 'OPP'], 
                          right_on=['GAME_DATE', 'TEAM_ID'],
                          suffixes=('_ONE', '_TWO'))
    combined_pred_away = pd.merge(away, home, 
                          left_on=['GAME_DATE', 'OPP'], 
                          right_on=['GAME_DATE', 'TEAM_ID'],
                          suffixes=('_ONE', '_TWO'))

    combined_pred = pd.concat([combined_pred_home, combined_pred_away], ignore_index = True)
    combined_pred = combined_pred.drop(columns = ['OPP_ONE', 'OPP_TWO', 'HOME_TWO', 'WIN_TWO', 'SEASON_YEAR_TWO'])
    combined_pred.rename(columns = {'SEASON_YEAR_ONE': 'SEASON_YEAR'}, inplace=True)

    return combined_pred

## Outcome Model
Since it is computationally expensive to run the second model to predict all the values in the dataset, we will perform feature selection and hyperparameter tuning on the model trained on the basic rolling statistics. Then, we will predict on the test set with both types of models to see which performs better

In [46]:
time_horizon = 5
df_rolling = combined_pred_stats

In [47]:
def get_training_set (df, date, num_seasons) :
    """
    Input: Date of games and number of seasons to include in dataset
    Output: All rows from the last num_seasons and all games in the current season up till the given date
    """
    # determine season of the game
    season = date.year if date.month >= 10 else date.year - 1
    
    # get games for training
    data = df[df['SEASON_YEAR'].between(season - num_seasons, season)].copy()
    data['DAYS_SINCE_GAME'] = [(date-game_day).days for game_day in data['GAME_DATE']]
    data = data[data['DAYS_SINCE_GAME'] > 0]

    data = data.sort_values(by = 'DAYS_SINCE_GAME')

    # split into X and y and only look at relevant columns
    X = data.drop(columns = ['WIN_ONE', 'GAME_DATE'])
    y = data['WIN_ONE']

    return (X,y)

def pred_by_date (df, model, date) :
    """
    Predict the outcome of all games on the given date. 
    """
    n = time_horizon # how many years in the past for training
    
    # determine season of the game
    season = date.year if date.month >= 10 else date.year - 1

    # get data in relevant time frame
    X, y = get_training_set(df, date, n)

    #df = pd.DataFrame(columns=X.columns)

    games_on_day = df[df['GAME_DATE'] == date].copy()
    games_on_day['DAYS_SINCE_GAME'] = np.zeros(len(games_on_day))

    test = games_on_day.drop(columns = ['WIN_ONE', 'GAME_DATE'])

    model.fit(X,y)
    pred = model.predict(test)
    correct = np.sum(pred == games_on_day['WIN_ONE'])
    games = len(pred)
    return correct, games

def test_model(df, model, dates) :
    total_correct = total_games = 0

    for d in dates:
        correct, games = pred_by_date(df, model, d)

        total_correct += correct
        total_games += games
    return total_correct, total_games

### Initial Model With Rolling Statistics

In [51]:
model = XGBClassifier(objective='binary:logistic', base_score = 0.5, random_state = 33)
correct,games = test_model(df_rolling, model, val_set)
correct / games

np.float64(0.6698113207547169)

### Feature Selection

The average feature importance scores is calculated for the three games for each season using XG Boost built-in feature importance 

In [None]:
def pred_by_date_with_importance(model, date):
    n = 5 
    season = date.year if date.month >= 10 else date.year - 1
    X, y = get_training_set(date, n)
    # one hot encoding on the Home feature 
    games_on_day = df[df['GAME_DATE'] == date].copy()
    games_on_day['DAYS_SINCE_GAME'] = np.zeros(len(games_on_day))

    test = games_on_day.drop(columns = ['WIN_ONE', 'GAME_DATE'])

    model.fit(X,y)
    pred = model.predict(test)
    correct = np.sum(pred == games_on_day['WIN_ONE'])
    games = len(pred)
    importance_scores = model.get_booster().get_score(importance_type='gain')
    return correct, games, importance_scores

In [None]:
def test_model_with_importance(model) :
    """
    Outputs the average feature importance scores of game predictions
    """
    total_correct = total_games = 0
    feature_scores = {}
    for t in test:
        correct, games, importance_scores = pred_by_date_with_importance(model, t)
        
        for feature, score in importance_scores.items():
            if feature not in feature_scores:
                feature_scores[feature] = []
            feature_scores[feature].append(score)
            

        total_correct += correct
        total_games += games

    average_importance = {features: sum(scores)/len(scores) for features, scores in feature_scores.items()}  
    sorted_features = sorted(average_importance.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_features

In [None]:
model = XGBClassifier(objective='binary:logistic')
importance_scores = test_model_with_importance(model)
print(importance_scores)

Testing the model with the feature importance scores by iteratively removing the least important features and comparing the accuracy:

In [None]:
def get_training_set_with_features (date, num_seasons, features) :
    """
    Input: Date of games, number of seasons and feature subset to include in dataset
    Output: All rows from the last num_seasons and all games in the current season up till the given date
    """
    season = date.year if date.month >= 10 else date.year - 1
    data = df[df['SEASON_YEAR'].between(season - num_seasons, season)].copy()
    data['DAYS_SINCE_GAME'] = [(date-game_day).days for game_day in data['GAME_DATE']]
    data = data[data['DAYS_SINCE_GAME'] > 0]

    data = data.sort_values(by = 'DAYS_SINCE_GAME')

    X = data[features]
    y = data['WIN_ONE']

    return (X,y)

def pred_by_date_with_features (model, date, features) :
    n = 5 
    season = date.year if date.month >= 10 else date.year - 1

    X, y = get_training_set_with_features(date, n, features)

    games_on_day = df[df['GAME_DATE'] == date].copy()
    games_on_day['DAYS_SINCE_GAME'] = np.zeros(len(games_on_day))

    test = games_on_day[features]
    model.fit(X,y)
    pred = model.predict(test)
    correct = np.sum(pred == games_on_day['WIN_ONE'])
    games = len(pred)
    return correct, games

In [None]:
def feature_selection_with_importance(model, current_features, min_subset_size, top_n) :
    """
    Iterates through the feature importance scores and iteratively remove the least importance features
    """
    results = []
    # current_features = [f[0] for f in feature_importance]
    
    while len(current_features) >= min_subset_size:
        total_correct = total_games = 0
        print(f"Evaluating with {len(current_features)} features...")
        for t in test:    
            correct, games = pred_by_date_with_features(model, t, features = current_features)
        
            total_correct += correct
            total_games += games
        print(current_features, ':', total_correct/total_games)
        results.append((current_features.copy(), total_correct/total_games))
        current_features.pop(-1)
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_n]

In [None]:
model = XGBClassifier(objective='binary:logistic')
sorted_features = [f[0] for f in importance_scores]
print(sorted_features)
top_subsets = feature_selection_with_importance(model, sorted_features, min_subset_size=20, top_n=10)

for i, (subset, acc) in enumerate(top_subsets, 1):
    print(f"#{i}: Features = {subset}, Accuracy = {acc:.4f}")

In [None]:
# best performing feature subset
best_feature_subset = top_subsets[0][0]
print('Best feature subset: ', best_feature_subset)
total_correct = total_games = 0
for t in test:
    correct, games = pred_by_date_with_features(model, t, best_feature_subset)

    total_correct += correct
    total_games += games
print('Accuracy:', total_correct / total_games)

In [None]:
# from itertools import combinations
# def feature_selection(model, feature_names, min_subset_size, max_subset_size, top_n) :
#     """
#     Iterates through the feature subsets and returns the top n subsets that gives the best scores
#     """
#     print('start')
#     results = []
#     for n in range(min_subset_size, max_subset_size + 1):
#         print(n)
#         for subset in combinations(feature_names, n):
#             print(subset)
#             total_correct = total_games = 0
#             for t in test:
#                 print('test')
#                 correct, games = pred_by_date_with_features(model, t, features = list(subset))
        
#                 total_correct += correct
#                 total_games += games
#             print(subset, ':', total_correct/total_games)
#             results.append((subset, correct/games))
#     results.sort(key=lambda x: x[1], reverse=True)
#     return results[:top_n]

In [None]:
# goes through every combinations of size 40; takes too long (5+ hours)
# model = XGBClassifier(objective='binary:logistic')
# all_features = [col for col in df.columns if col not in ['WIN_ONE', 'GAME_DATE', 'SEASON_YEAR']]
# top_subsets = feature_selection(model, all_features, min_subset_size=40, max_subset_size=40, top_n=10)

# for i, (subset, acc, total) in enumerate(top_subsets, 1):
#     print(f"#{i}: Features = {subset}, Accuracy = {acc:.4f}")

### Hyperparameter Tuning for XGBoost

In [None]:
def pred_by_date_multiple_models (models_dict, date) :
    """
    Predict the outcome of all games on the given date for all models given. Used specifically to make
    cross validation more efficient
    """
    n = 5 # how many years in the past for training
    
    # determine season of the game
    season = date.year if date.month >= 10 else date.year - 1

    # get data in relevant time frame
    X, y = get_training_set(date, n)

    games_on_day = df[df['GAME_DATE'] == date].copy()
    games_on_day['DAYS_SINCE_GAME'] = np.zeros(len(games_on_day))

    test = games_on_day.drop(columns = ['WIN_ONE', 'GAME_DATE'])

    scores = np.zeros(len(models_dict))
    for k, v in models_dict.items() :
        v.fit(X,y)
        pred = v.predict(val_set)
        scores[k] = np.sum(pred == games_on_day['WIN_ONE'])
    return scores, len(games_on_day)

In [None]:
# XGBoost parameters
param_grid = {
    "n_estimators": [50, 100, 200, 400],
    "eta": [0.01, 0.05, 0.1, 0.2], # learning_rate
    "max_depth": [4, 6, 8, 10], # maximum depth of a tree
    "subsample": [0.5, 0.7, 1], # fraction of observation to be radnomly sampled for each tree
    "colsample_bytree": [0.5, 0.7, 1], # fraction of columns to be random samples for each tree
    "alpha": [0.5, 1, 2, 5] # lasso regression
}

param_dict = {} # store params with key corresponding to index of score in np.array
index = 0

# Iterate over all combinations of hyperparameters
for values in itertools.product(*param_grid.values()):
    param_dict[index] = XGBClassifier(objective='binary:logistic', random_state = 33, **dict(zip(param_grid.keys(), values)))
    index += 1

scores = np.zeros(len(param_dict))
total_games = 0

first_season = df['SEASON_YEAR'].min()
last_season = df['SEASON_YEAR'].max()-4

for t in test:
    s, g = pred_by_date_multiple_models(param_dict, t)

    scores += s
    total_games += g
    print(scores / total_games)

print('final scores: ', scores / total_games)

In [None]:
all_scores = scores / total_games
best_model = param_dict[all_scores.argmax()]
best_model.get_params() #'n_estimators': 400, eta: 0.01, max_depth: 4, subsample: 0.7, colsample_bytree: 0.7, alpha: 2

In [None]:
top_five_models = np.argpartition(all_scores, -5)[-5:]
top_five_models = top_five_models[np.argsort(-all_scores[top_five_models])]
top_five_scores = all_scores[top_five_models]
print(top_five_scores)
for i in top_five_models : 
    p = param_dict[i].get_params()
    print(f"n_estimators = {p['n_estimators']}, eta = {p['eta']}, max_depth = {p['max_depth']}, subsample = {p['subsample']}, colsample_bytree = {p['colsample_bytree']}, alpha = {p['alpha']}")

### Test Models
We want to test the model trained on rolling averages and the predicted statistics from the second model. We will predict every game in the last 4 seasons. This means we need to predict all the statistics for the games in the last 9 seasons using the second model.

In [32]:
first_test_season = all_stats_cleaned['SEASON_YEAR'].max() - 4
last_test_season =  all_stats_cleaned['SEASON_YEAR'].max()
test_set = all_stats_cleaned[all_stats_cleaned['SEASON_YEAR'] >= first_test_season]['GAME_DATE'].unique()
train_set = all_stats_cleaned[all_stats_cleaned['SEASON_YEAR'] >= first_test_season - 5]['GAME_DATE'].unique()

In [60]:
#if import in from csv
#df_model = pd.read_csv('df_model_basic_parameters.csv')
#df_model['GAME_DATE'] = pd.to_datetime(df_model['GAME_DATE'])

In [159]:
df_model = pred_training_set(model_training_set, first_test_season - time_horizon, last_test_season)

Predicting Day 1 / 971 
Predicting Day 2 / 971 
Predicting Day 3 / 971 
Predicting Day 4 / 971 
Predicting Day 5 / 971 
Predicting Day 6 / 971 
Predicting Day 7 / 971 
Predicting Day 8 / 971 
Predicting Day 9 / 971 
Predicting Day 10 / 971 
Predicting Day 11 / 971 
Predicting Day 12 / 971 
Predicting Day 13 / 971 
Predicting Day 14 / 971 
Predicting Day 15 / 971 
Predicting Day 16 / 971 
Predicting Day 17 / 971 
Predicting Day 18 / 971 
Predicting Day 19 / 971 
Predicting Day 20 / 971 
Predicting Day 21 / 971 
Predicting Day 22 / 971 
Predicting Day 23 / 971 
Predicting Day 24 / 971 
Predicting Day 25 / 971 
Predicting Day 26 / 971 
Predicting Day 27 / 971 
Predicting Day 28 / 971 
Predicting Day 29 / 971 
Predicting Day 30 / 971 
Predicting Day 31 / 971 
Predicting Day 32 / 971 
Predicting Day 33 / 971 
Predicting Day 34 / 971 
Predicting Day 35 / 971 
Predicting Day 36 / 971 
Predicting Day 37 / 971 
Predicting Day 38 / 971 
Predicting Day 39 / 971 
Predicting Day 40 / 971 
Predictin

In [44]:
final_model = XGBClassifier(n_estimators = 100, eta = 0.05, max_depth = 4, subsample = 0.5, colsample_bytree = 0.5, alpha = 1, random_state=33)

#### Rolling Statistics

In [52]:
correct, games = test_model(df_rolling, final_model, test_set)
print("Score:", correct / games)

Score: 0.6349288042545891


#### ML Model

In [129]:
correct, games = test_model(df_model, final_model, test_set)
print("Score:", correct / games)

Score: 0.979241722422371


In [130]:
df_model['GAME_DATE'].max()

Timestamp('2024-04-14 00:00:00')

In [64]:
df_model = df_model[df_rolling.columns]

In [110]:
df_model = df_model.sort_values(by = ['TEAM_ID_ONE', 'GAME_DATE'])
df_rolling_part = df_rolling[df_rolling['SEASON_YEAR'].between(2014, 2023)].sort_values(by = ['TEAM_ID_ONE', 'GAME_DATE'])
combined_stats_part = combined_stats[combined_stats['SEASON_YEAR'].between(2014, 2023)].sort_values(by = ['TEAM_ID_ONE', 'GAME_DATE'])

In [123]:
df_model.iloc[:5, 20:]

Unnamed: 0,WIN_LAST_ONE,TEAM_ID_TWO,FG_PCT_TWO,FG3_PCT_TWO,FT_PCT_TWO,OREB_TWO,DREB_TWO,REB_TWO,AST_TWO,STL_TWO,BLK_TWO,TOV_TWO,PF_TWO,SEASON_YEAR,EFG%_TWO,TS%_TWO,WIN_STREAK_TWO,WIN_PERCENTAGE_TWO,ELO_TWO,WIN_LAST_TWO
7213,0.0,24,0.489568,0.335449,0.665901,12.724613,34.93801,44.36896,19.67755,6.821535,3.224573,14.741632,18.017159,2014,0.673542,0.559697,0,0.416556,1574.766174,1.0
26,1.0,17,0.420344,0.29053,0.829226,10.063592,36.17952,44.801605,20.174774,5.874753,4.138094,16.903025,21.174635,2014,0.578456,0.521106,0,0.529816,1533.826551,0.0
7265,0.0,22,0.528347,0.45558,0.777013,8.303092,38.414825,49.08824,27.503584,5.48794,7.622308,15.457068,18.050951,2014,0.710087,0.57446,0,0.629931,1674.141487,1.0
7280,0.0,29,0.460664,0.377649,0.773966,11.642198,36.50787,44.827686,24.433992,7.021293,6.240212,12.641404,19.289635,2014,0.603931,0.5652,1,0.434805,1510.10097,1.0
81,1.0,15,0.413949,0.362901,0.759866,10.345777,30.776129,41.95529,21.843782,5.277457,1.829193,13.85477,20.912514,2014,0.530434,0.499021,0,0.499144,1488.887976,0.0


In [124]:
df_rolling_part.iloc[:5, 20:]

Unnamed: 0,WIN_LAST_ONE,TEAM_ID_TWO,FG_PCT_TWO,FG3_PCT_TWO,FT_PCT_TWO,OREB_TWO,DREB_TWO,REB_TWO,AST_TWO,STL_TWO,BLK_TWO,TOV_TWO,PF_TWO,SEASON_YEAR,EFG%_TWO,TS%_TWO,WIN_STREAK_TWO,WIN_PERCENTAGE_TWO,ELO_TWO,WIN_LAST_TWO
48278,0.0,24,0.459,0.4148,0.7706,11.8,30.4,42.2,21.6,6.4,3.8,14.4,23.4,2014,0.681119,0.582714,0,0.416556,1574.766174,1.0
4219,1.0,17,0.4798,0.44,0.76,11.0,34.0,45.0,22.0,4.6,5.0,17.4,20.0,2014,0.653826,0.570423,0,0.529816,1533.826551,0.0
48279,0.0,22,0.4478,0.3642,0.8108,8.6,34.8,43.4,23.2,4.8,3.8,11.4,20.4,2014,0.622998,0.538621,0,0.629931,1674.141487,1.0
48280,0.0,29,0.4256,0.3072,0.7412,9.4,33.6,43.0,22.0,7.0,5.0,14.2,18.2,2014,0.526557,0.501114,1,0.434805,1510.10097,1.0
4220,1.0,15,0.45,0.4808,0.764,11.6,28.2,39.8,23.2,5.8,3.2,14.8,24.4,2014,0.593362,0.529088,0,0.499144,1488.887976,0.0


In [115]:
combined_stats_part = combined_stats_part[df_rolling.columns]

In [125]:
combined_stats_part.iloc[:5, 20:]

Unnamed: 0,WIN_LAST_ONE,TEAM_ID_TWO,FG_PCT_TWO,FG3_PCT_TWO,FT_PCT_TWO,OREB_TWO,DREB_TWO,REB_TWO,AST_TWO,STL_TWO,BLK_TWO,TOV_TWO,PF_TWO,SEASON_YEAR,EFG%_TWO,TS%_TWO,WIN_STREAK_TWO,WIN_PERCENTAGE_TWO,ELO_TWO,WIN_LAST_TWO
65595,0.0,24,0.411,0.308,0.818,16,32,48,26,13,9,10.0,22,2014,0.544444,0.521431,0,0.416556,1575.940065,1.0
65649,1.0,17,0.383,0.375,0.857,11,33,44,25,5,5,18.0,26,2014,0.604938,0.509752,0,0.529816,1541.58247,0.0
65710,0.0,22,0.449,0.294,0.711,11,39,50,25,7,9,21.0,15,2014,0.557971,0.548297,0,0.629931,1664.654161,1.0
65731,0.0,29,0.495,0.286,0.741,11,40,51,31,6,7,21.0,30,2014,0.587629,0.56025,1,0.434805,1518.897711,1.0
65749,1.0,15,0.476,0.381,0.727,13,31,44,26,2,6,15.0,29,2014,0.619048,0.540297,0,0.499144,1487.58517,0.0


In [126]:
df_model.shape

(23958, 40)

In [127]:
df_model.shape

(23958, 40)

In [128]:
combined_stats_part.shape

(23958, 40)