<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (2022-2024)</p>
</div>

<a name="Feature Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups and Targets](#1.-Create-Team-Matchups-and-Targets)**

- [1.1. Clean Game Data](#1.1.-Clean-Game-Data)

- [1.2. Reshape to Game Matchups](#1.2.-Reshape-to-Game-Matchups)

- [1.3. Create Target Variables](#1.3.-Create-Target-Variables)

**[2. Step by Step for Rolling Difference Calculation](#2.Create-Rolling-Difference-Window-Statistics)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Raw Data

[Return to top](#Feature-Engineering)

In [2]:
team_bs_df = pd.read_csv('../../data/original/nba_games_box_scores_2022_2024.csv')

In [3]:
team_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7525 entries, 0 to 7524
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SEASON_ID          7525 non-null   int64  
 1   TEAM_ID            7525 non-null   int64  
 2   TEAM_ABBREVIATION  7525 non-null   object 
 3   TEAM_NAME          7525 non-null   object 
 4   GAME_ID            7525 non-null   int64  
 5   GAME_DATE          7525 non-null   object 
 6   MATCHUP            7525 non-null   object 
 7   WL                 7514 non-null   object 
 8   MIN                7525 non-null   int64  
 9   PTS                7525 non-null   int64  
 10  FGM                7525 non-null   int64  
 11  FGA                7525 non-null   int64  
 12  FG_PCT             7523 non-null   float64
 13  FG3M               7525 non-null   int64  
 14  FG3A               7525 non-null   float64
 15  FG3_PCT            7523 non-null   float64
 16  FTM                7525 

<a name="1.-Create-Team-Matchups-and-Targets"></a>
# 1. Create Team Matchups and Targets

[Return to top](#Feature-Engineering)

<a name="1.1.-Clean-Game-Data"></a>
## 1.1. Clean Game Data

[Return to top](#Feature-Engineering)

We need to do three key things to clean the data:

1. Remove games with team aggregated game times of less than 238 minutes (which will remove exhibition matches).
2. Retain only games that are part of the regular season.
3. Remove any orphans (i.e., game IDs that do not have a partner) when reshaping to matchups.

Last 3 NBA regular seasons start and end dates:

- 2021-22 season: 2021-10-19 to 2022-04-10
- 2022-23 season: 2022-10-18 to 2023-04-09
- 2023-24 season: 2023-10-24 to 2024-04-14

In [4]:
# last 3 seasons start and end dates and labels
season_start_dates = ['2021-10-19', '2022-10-18', '2023-10-24']
season_end_dates   = ['2022-04-10', '2023-04-09', '2024-04-14']
season_labels      = ['2021-22', '2022-23', '2023-24']

In [5]:
# clean up the data
team_bs_df_cleaned = utl.clean_team_bs_data(team_bs_df, season_start_dates=season_start_dates, 
                                            season_end_dates=season_end_dates, season_labels=season_labels)

Season 2021-22: 1230 games
Season 2022-23: 1230 games
Season 2023-24: 736 games


<a name="1.2.-Reshape-to-Game-Matchups"></a>
## 1.2. Reshape to Game Matchups

[Return to top](#Feature-Engineering)

In [6]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# reshape team box score data to wide format so each row is a game matchup
team_bs_matchups_df = utl.reshape_team_bs_to_matchups(team_bs_df_cleaned, non_stats_cols)

Season 2021-22: 1222 games
Season 2022-23: 1221 games
Season 2023-24: 729 games


In [7]:
team_bs_matchups_df.head()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,AWAY_PLUS_MINUS
0,2021-22,1610612737,ATL,Atlanta Hawks,22100014,2021-10-21,W,242,113,45,94,0.479,15,35.0,0.429,8,9,0.889,6.0,49.0,55.0,31,8.0,9,13,...,26.0,1610612742,DAL,Dallas Mavericks,L,240,87,31,93,0.333,13,43.0,0.302,12,13,0.923,10.0,40.0,50.0,16,7.0,3,15,21,-26.0
1,2021-22,1610612737,ATL,Atlanta Hawks,22100043,2021-10-25,W,238,122,46,90,0.511,12,32.0,0.375,18,21,0.857,10.0,39.0,49.0,24,11.0,3,13,...,18.0,1610612765,DET,Detroit Pistons,L,239,104,40,91,0.44,9,33.0,0.273,15,18,0.833,11.0,25.0,36.0,26,7.0,6,14,15,-18.0
2,2021-22,1610612737,ATL,Atlanta Hawks,22100097,2021-11-01,W,240,118,38,83,0.458,13,34.0,0.382,29,29,1.0,13.0,34.0,47.0,24,9.0,5,11,...,7.0,1610612764,WAS,Washington Wizards,L,240,111,41,86,0.477,13,39.0,0.333,16,16,1.0,7.0,29.0,36.0,27,7.0,4,12,24,-7.0
3,2021-22,1610612737,ATL,Atlanta Hawks,22100120,2021-11-04,L,240,98,35,82,0.427,7,28.0,0.25,21,26,0.808,5.0,27.0,32.0,18,11.0,4,9,...,-18.0,1610612762,UTA,Utah Jazz,W,240,116,41,81,0.506,17,41.0,0.415,17,23,0.739,8.0,38.0,46.0,30,6.0,4,14,20,18.0
4,2021-22,1610612737,ATL,Atlanta Hawks,22100193,2021-11-14,W,241,120,47,97,0.485,15,35.0,0.429,11,13,0.846,15.0,36.0,51.0,21,6.0,1,12,...,20.0,1610612749,MIL,Milwaukee Bucks,L,240,100,38,84,0.452,14,41.0,0.341,10,16,0.625,4.0,26.0,30.0,24,8.0,3,11,17,-20.0


<a name="1.3.-Create-Target-Variables"></a>
## 1.3. Create Target Variables

[Return to top](#Feature-Engineering)

There are three targets of interest:

1. **Total Game Points (over / under):** This can be calculated as the sum of `HOME_PTS + AWAY_PTS`.
2. **Difference in Game Points (plus / minus):** This can be calculated in relation to the home team as the following difference: `HOME_PTS - AWAY_PTS`.
3. **Game Winner (moneyline):** This can be defined in relation to the home team using the `HOME_WL` column, where a win for the home team is equal to 1 and a loss for the home team equal to 0. We will create a new column called `GAME_RESULT` for this indicator.

In [8]:
# create the above three target variables
team_bs_matchups_df = utl.create_target_variables(team_bs_matchups_df, 'HOME_WL', 'HOME_PTS', 'AWAY_PTS')

In [9]:
team_bs_matchups_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS']].tail()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,GAME_RESULT,TOTAL_PTS,PLUS_MINUS
3167,2024-01-24,22300620,Washington Wizards,Minnesota Timberwolves,107,118,0,225,-11.0
3168,2024-01-25,22300628,Washington Wizards,Utah Jazz,108,123,0,231,-15.0
3169,2024-01-31,22300676,Washington Wizards,LA Clippers,109,125,0,234,-16.0
3170,2024-02-02,22300689,Washington Wizards,Miami Heat,102,110,0,212,-8.0
3171,2024-02-04,22300705,Washington Wizards,Phoenix Suns,112,140,0,252,-28.0


<a name="2.Create-Rolling-Difference-Window-Statistics"></a>
# 2. Step by Step Calculations for Rolling Difference (Average Over/Under Performance)

[Return to top](#Feature-Engineering)

Here we create the average difference in box scores between teams over a rolling window of the previous $n$-games.

In [10]:
team_bs_matchups_df.shape

(3172, 52)

In [11]:
# declare variables
team_col = "HOME_TEAM_NAME"
stratify_by_season = True

In [12]:
# determine whether to use 'HOME' or 'AWAY' stats
prefix = 'HOME_' if 'HOME' in team_col else 'AWAY_'
prefix

'HOME_'

In [13]:
# identify stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'AWAY_TEAM_ID',
                  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_WL', 'AWAY_WL', 'HOME_MIN', 
                  'AWAY_MIN', 'HOME_TEAM_ABBREVIATION', 'AWAY_TEAM_ABBREVIATION']
stats_cols = [col for col in team_bs_matchups_df.columns if col not in non_stats_cols]

In [14]:
# Filter the stats columns 
# We will have three stats columns lists: prefix with only HOME_, prefix with only AWAY_, and both HOME_ and AWAY stats 
# We require all lists for filtering the df columns later for taking the difference 
filtered_stats_cols_home = [col for col in stats_cols if col.startswith('HOME_')]
print(filtered_stats_cols_home)    

filtered_stats_cols_away = [col for col in stats_cols if col.startswith('AWAY_')]
print(filtered_stats_cols_away)    

filtered_stats_cols_both = [col for col in stats_cols if col.startswith('HOME_') or col.startswith('AWAY_')]
print(filtered_stats_cols_both)    



['HOME_PTS', 'HOME_FGM', 'HOME_FGA', 'HOME_FG_PCT', 'HOME_FG3M', 'HOME_FG3A', 'HOME_FG3_PCT', 'HOME_FTM', 'HOME_FTA', 'HOME_FT_PCT', 'HOME_OREB', 'HOME_DREB', 'HOME_REB', 'HOME_AST', 'HOME_STL', 'HOME_BLK', 'HOME_TOV', 'HOME_PF']
['AWAY_PTS', 'AWAY_FGM', 'AWAY_FGA', 'AWAY_FG_PCT', 'AWAY_FG3M', 'AWAY_FG3A', 'AWAY_FG3_PCT', 'AWAY_FTM', 'AWAY_FTA', 'AWAY_FT_PCT', 'AWAY_OREB', 'AWAY_DREB', 'AWAY_REB', 'AWAY_AST', 'AWAY_STL', 'AWAY_BLK', 'AWAY_TOV', 'AWAY_PF']
['HOME_PTS', 'HOME_FGM', 'HOME_FGA', 'HOME_FG_PCT', 'HOME_FG3M', 'HOME_FG3A', 'HOME_FG3_PCT', 'HOME_FTM', 'HOME_FTA', 'HOME_FT_PCT', 'HOME_OREB', 'HOME_DREB', 'HOME_REB', 'HOME_AST', 'HOME_STL', 'HOME_BLK', 'HOME_TOV', 'HOME_PF', 'AWAY_PTS', 'AWAY_FGM', 'AWAY_FGA', 'AWAY_FG_PCT', 'AWAY_FG3M', 'AWAY_FG3A', 'AWAY_FG3_PCT', 'AWAY_FTM', 'AWAY_FTA', 'AWAY_FT_PCT', 'AWAY_OREB', 'AWAY_DREB', 'AWAY_REB', 'AWAY_AST', 'AWAY_STL', 'AWAY_BLK', 'AWAY_TOV', 'AWAY_PF']


In [15]:
# ensure data is sorted by team, season (if stratified), and date for accurate rolling calculation
# set GAME_ID and GAME_DATE as the indices to preserve them through the rolling operation
sort_cols = [team_col, 'SEASON_ID', 'GAME_DATE'] if stratify_by_season else [team_col, 'GAME_DATE']
sort_cols

['HOME_TEAM_NAME', 'SEASON_ID', 'GAME_DATE']

In [16]:
df_sorted = team_bs_matchups_df.sort_values(by=sort_cols).set_index(['GAME_ID', 'GAME_DATE'])
df_sorted.shape

(3172, 50)

In [17]:
df_sorted.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,HOME_PF,PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,GAME_RESULT,TOTAL_PTS
GAME_ID,GAME_DATE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
22100014,2021-10-21,2021-22,1610612737,ATL,Atlanta Hawks,W,242,113,45,94,0.479,15,35.0,0.429,8,9,0.889,6.0,49.0,55.0,31,8.0,9,13,16,26.0,1610612742,DAL,Dallas Mavericks,L,240,87,31,93,0.333,13,43.0,0.302,12,13,0.923,10.0,40.0,50.0,16,7.0,3,15,21,1,200
22100043,2021-10-25,2021-22,1610612737,ATL,Atlanta Hawks,W,238,122,46,90,0.511,12,32.0,0.375,18,21,0.857,10.0,39.0,49.0,24,11.0,3,13,19,18.0,1610612765,DET,Detroit Pistons,L,239,104,40,91,0.44,9,33.0,0.273,15,18,0.833,11.0,25.0,36.0,26,7.0,6,14,15,1,226
22100097,2021-11-01,2021-22,1610612737,ATL,Atlanta Hawks,W,240,118,38,83,0.458,13,34.0,0.382,29,29,1.0,13.0,34.0,47.0,24,9.0,5,11,17,7.0,1610612764,WAS,Washington Wizards,L,240,111,41,86,0.477,13,39.0,0.333,16,16,1.0,7.0,29.0,36.0,27,7.0,4,12,24,1,229
22100120,2021-11-04,2021-22,1610612737,ATL,Atlanta Hawks,L,240,98,35,82,0.427,7,28.0,0.25,21,26,0.808,5.0,27.0,32.0,18,11.0,4,9,24,-18.0,1610612762,UTA,Utah Jazz,W,240,116,41,81,0.506,17,41.0,0.415,17,23,0.739,8.0,38.0,46.0,30,6.0,4,14,20,0,214
22100193,2021-11-14,2021-22,1610612737,ATL,Atlanta Hawks,W,241,120,47,97,0.485,15,35.0,0.429,11,13,0.846,15.0,36.0,51.0,21,6.0,1,12,19,20.0,1610612749,MIL,Milwaukee Bucks,L,240,100,38,84,0.452,14,41.0,0.341,10,16,0.625,4.0,26.0,30.0,24,8.0,3,11,17,1,220
22100202,2021-11-15,2021-22,1610612737,ATL,Atlanta Hawks,W,239,129,47,85,0.553,14,30.0,0.467,21,32,0.656,9.0,37.0,46.0,32,10.0,7,11,17,18.0,1610612753,ORL,Orlando Magic,L,240,111,43,95,0.453,16,43.0,0.372,9,14,0.643,12.0,29.0,41.0,30,8.0,5,16,27,1,240
22100215,2021-11-17,2021-22,1610612737,ATL,Atlanta Hawks,W,240,110,41,81,0.506,13,37.0,0.351,15,18,0.833,6.0,34.0,40.0,28,9.0,4,11,17,11.0,1610612738,BOS,Boston Celtics,L,242,99,37,84,0.44,11,41.0,0.268,14,17,0.824,12.0,30.0,42.0,24,9.0,2,14,17,1,209
22100242,2021-11-20,2021-22,1610612737,ATL,Atlanta Hawks,W,241,115,43,82,0.524,12,34.0,0.353,17,21,0.81,8.0,38.0,46.0,24,6.0,6,12,22,10.0,1610612766,CHA,Charlotte Hornets,L,240,105,43,102,0.422,10,40.0,0.25,9,15,0.6,20.0,31.0,51.0,26,6.0,2,10,22,1,220
22100255,2021-11-22,2021-22,1610612737,ATL,Atlanta Hawks,W,239,113,42,87,0.483,14,34.0,0.412,15,16,0.938,8.0,36.0,44.0,25,6.0,6,7,16,12.0,1610612760,OKC,Oklahoma City Thunder,L,240,101,39,97,0.402,10,38.0,0.263,13,14,0.929,16.0,34.0,50.0,24,5.0,5,8,15,1,214
22100293,2021-11-27,2021-22,1610612737,ATL,Atlanta Hawks,L,240,90,33,93,0.355,9,37.0,0.243,15,20,0.75,13.0,39.0,52.0,18,8.0,6,6,17,-9.0,1610612752,NYK,New York Knicks,W,241,99,36,82,0.439,11,29.0,0.379,16,22,0.727,8.0,42.0,50.0,20,4.0,3,10,17,0,189


In [18]:
# apply grouping for rolling calculation
group_cols = [team_col, 'SEASON_ID'] if stratify_by_season else [team_col]
print(group_cols)

# group by 'group_cols' so that you only have home or away team stats stratified by season
rolling_stats = df_sorted.groupby(group_cols)


['HOME_TEAM_NAME', 'SEASON_ID']


In [19]:
## this cell is commented out because its output is extremely long
## the purpose of this cell is to double check grouping and stratification works

## check that grouping works
## df should be grouped by team (HOME or AWAY), and should be stratified by season
#rolling_stats.apply(display)

In [20]:
# this cell is to double check the work of the bottom cell
# Filter rolling_stats df to only show numerical stats
# Then take the difference between columns
# ie HOME_PTS - AWAY_PTS, and roll the calculation rightward
# Afterwards filter the columns to only display home stats, as the away stats will have become nan 
rolling_stats[filtered_stats_cols_both].diff(axis=1,periods=-18)[filtered_stats_cols_home]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,HOME_PF
HOME_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Atlanta Hawks,2021-22,22100014,2021-10-21,26,14,1,0.146,2,-8.0,0.127,-4,-4,-0.034,-4.0,9.0,5.0,15,1.0,6,-2,-5
Atlanta Hawks,2021-22,22100043,2021-10-25,18,6,-1,0.071,3,-1.0,0.102,3,3,0.024,-1.0,14.0,13.0,-2,4.0,-3,-1,4
Atlanta Hawks,2021-22,22100097,2021-11-01,7,-3,-3,-0.019,0,-5.0,0.049,13,13,0.000,6.0,5.0,11.0,-3,2.0,1,-1,-7
Atlanta Hawks,2021-22,22100120,2021-11-04,-18,-6,1,-0.079,-10,-13.0,-0.165,4,3,0.069,-3.0,-11.0,-14.0,-12,5.0,0,-5,4
Atlanta Hawks,2021-22,22100193,2021-11-14,20,9,13,0.033,1,-6.0,0.088,1,-3,0.221,11.0,10.0,21.0,-3,-2.0,-2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Washington Wizards,2023-24,22300620,2024-01-24,-11,-5,-23,0.061,1,-8.0,0.136,-2,0,-0.069,-8.0,9.0,1.0,5,-6.0,-4,14,0
Washington Wizards,2023-24,22300628,2024-01-25,-15,-7,-9,-0.026,-6,-1.0,-0.186,5,8,-0.083,-8.0,-9.0,-17.0,-1,5.0,5,-4,-5
Washington Wizards,2023-24,22300676,2024-01-31,-16,-2,-3,-0.006,-3,-1.0,-0.090,-9,-7,-0.197,-6.0,-1.0,-7.0,-6,-5.0,3,5,3
Washington Wizards,2023-24,22300689,2024-02-02,-8,-1,2,-0.021,4,11.0,0.036,-10,-14,0.039,-8.0,-8.0,-16.0,4,1.0,3,-4,7


In [21]:
# Run full the sorting, difference, and rolling average functions from df_sorted
rolling_diff_stats_home = (df_sorted.groupby(group_cols)[filtered_stats_cols_both]
                                .diff(axis=1,periods=-18)[filtered_stats_cols_home]
                                .rolling(window=5, min_periods=1)
                                .mean()
                                .round(3)
                                .groupby(group_cols) # need to groupby again for shift
                                .shift(1)  # lag to exclude the current game from the rolling average
                                .add_prefix('ROLLDIFF_')
                                .reset_index() # reset the index to convert GAME_ID back into a column
                     )

# check if rolling diff stats has worked
rolling_diff_stats_home

Unnamed: 0,HOME_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,ROLLDIFF_HOME_PTS,ROLLDIFF_HOME_FGM,ROLLDIFF_HOME_FGA,ROLLDIFF_HOME_FG_PCT,ROLLDIFF_HOME_FG3M,ROLLDIFF_HOME_FG3A,ROLLDIFF_HOME_FG3_PCT,ROLLDIFF_HOME_FTM,ROLLDIFF_HOME_FTA,ROLLDIFF_HOME_FT_PCT,ROLLDIFF_HOME_OREB,ROLLDIFF_HOME_DREB,ROLLDIFF_HOME_REB,ROLLDIFF_HOME_AST,ROLLDIFF_HOME_STL,ROLLDIFF_HOME_BLK,ROLLDIFF_HOME_TOV,ROLLDIFF_HOME_PF
0,Atlanta Hawks,2021-22,22100014,2021-10-21,,,,,,,,,,,,,,,,,,
1,Atlanta Hawks,2021-22,22100043,2021-10-25,26.00,14.000,1.0,0.146,2.000,-8.000,0.127,-4.0,-4.00,-0.034,-4.000,9.000,5.000,15.000,1.000,6.000,-2.000,-5.000
2,Atlanta Hawks,2021-22,22100097,2021-11-01,22.00,10.000,0.0,0.108,2.500,-4.500,0.114,-0.5,-0.50,-0.005,-2.500,11.500,9.000,6.500,2.500,1.500,-1.500,-0.500
3,Atlanta Hawks,2021-22,22100120,2021-11-04,17.00,5.667,-1.0,0.066,1.667,-4.667,0.093,4.0,4.00,-0.003,0.333,9.333,9.667,3.333,2.333,1.333,-1.333,-2.667
4,Atlanta Hawks,2021-22,22100193,2021-11-14,8.25,2.750,-0.5,0.030,-1.250,-6.750,0.028,4.0,3.75,0.015,-0.500,4.250,3.750,-0.500,3.000,1.000,-2.250,-1.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3167,Washington Wizards,2023-24,22300620,2024-01-24,-9.80,-4.600,2.8,-0.066,0.800,3.600,-0.028,-1.4,-3.40,0.055,-0.400,-6.400,-6.800,-1.400,-0.600,-2.400,-1.400,4.200
3168,Washington Wizards,2023-24,22300628,2024-01-25,-8.80,-5.000,-1.4,-0.049,1.200,1.800,0.007,0.0,-2.60,0.097,-2.200,-4.400,-6.600,-1.600,-1.000,-3.400,0.200,3.400
3169,Washington Wizards,2023-24,22300676,2024-01-31,-10.20,-4.800,-3.4,-0.035,-0.200,0.800,-0.024,-0.4,-2.00,0.058,-4.200,-5.400,-9.600,-0.800,0.800,-2.000,-0.600,2.000
3170,Washington Wizards,2023-24,22300689,2024-02-02,-11.00,-4.000,-5.0,-0.017,-2.200,-1.400,-0.057,-0.8,-1.00,-0.016,-4.600,-2.600,-7.200,-2.000,-1.000,-0.800,1.800,1.400


## Check to see if AWAY works

In [22]:
# Check to see it works for away
# declared variables
team_col = "AWAY_TEAM_NAME"
stratify_by_season = True

# determine whether to use 'HOME' or 'AWAY' stats
prefix = 'HOME_' if 'HOME' in team_col else 'AWAY_'
prefix

'AWAY_'

In [23]:
# ensure data is sorted by team, season (if stratified), and date for accurate rolling calculation
# set GAME_ID and GAME_DATE as the indices to preserve them through the rolling operation
sort_cols = [team_col, 'SEASON_ID', 'GAME_DATE'] if stratify_by_season else [team_col, 'GAME_DATE']
sort_cols

['AWAY_TEAM_NAME', 'SEASON_ID', 'GAME_DATE']

In [24]:
df_sorted = team_bs_matchups_df.sort_values(by=sort_cols).set_index(['GAME_ID', 'GAME_DATE'])
df_sorted.shape

(3172, 50)

In [25]:
df_sorted.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,HOME_PF,PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,GAME_RESULT,TOTAL_PTS
GAME_ID,GAME_DATE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
22100027,2021-10-23,2021-22,1610612739,CLE,Cleveland Cavaliers,W,242,101,37,89,0.416,7,28.0,0.25,20,27,0.741,12.0,42.0,54.0,23,6.0,6,10,16,6.0,1610612737,ATL,Atlanta Hawks,L,241,95,38,99,0.384,10,34.0,0.294,9,15,0.6,17.0,37.0,54.0,20,5.0,3,9,23,1,196
22100059,2021-10-27,2021-22,1610612740,NOP,New Orleans Pelicans,L,240,99,40,93,0.43,11,36.0,0.306,8,9,0.889,9.0,35.0,44.0,24,5.0,6,9,19,-3.0,1610612737,ATL,Atlanta Hawks,W,240,102,40,96,0.417,8,30.0,0.267,14,17,0.824,21.0,34.0,55.0,21,4.0,4,11,14,0,201
22100066,2021-10-28,2021-22,1610612764,WAS,Washington Wizards,W,240,122,46,99,0.465,13,34.0,0.382,17,21,0.81,14.0,37.0,51.0,26,6.0,4,6,16,11.0,1610612737,ATL,Atlanta Hawks,L,241,111,48,88,0.545,6,21.0,0.286,9,14,0.643,6.0,37.0,43.0,26,4.0,3,13,16,1,233
22100082,2021-10-30,2021-22,1610612755,PHI,Philadelphia 76ers,W,241,122,46,86,0.535,12,38.0,0.316,18,19,0.947,5.0,32.0,37.0,24,11.0,6,12,24,28.0,1610612737,ATL,Atlanta Hawks,L,239,94,36,95,0.379,8,22.0,0.364,14,19,0.737,20.0,29.0,49.0,24,6.0,1,19,17,1,216
22100113,2021-11-03,2021-22,1610612751,BKN,Brooklyn Nets,W,240,117,43,88,0.489,22,48.0,0.458,9,12,0.75,7.0,35.0,42.0,34,7.0,8,14,18,9.0,1610612737,ATL,Atlanta Hawks,L,241,108,41,94,0.436,13,35.0,0.371,13,15,0.867,12.0,40.0,52.0,23,8.0,4,14,16,1,225
22100137,2021-11-06,2021-22,1610612756,PHX,Phoenix Suns,W,240,121,45,88,0.511,14,43.0,0.326,17,21,0.81,5.0,37.0,42.0,26,8.0,6,12,22,4.0,1610612737,ATL,Atlanta Hawks,L,240,117,42,97,0.433,13,33.0,0.394,20,23,0.87,14.0,36.0,50.0,18,3.0,1,12,16,1,238
22100152,2021-11-08,2021-22,1610612744,GSW,Golden State Warriors,W,241,127,44,92,0.478,18,44.0,0.409,21,22,0.955,10.0,29.0,39.0,31,13.0,4,13,22,14.0,1610612737,ATL,Atlanta Hawks,L,239,113,38,82,0.463,17,43.0,0.395,20,26,0.769,3.0,32.0,35.0,21,6.0,2,18,16,1,240
22100156,2021-11-09,2021-22,1610612762,UTA,Utah Jazz,W,240,110,41,81,0.506,15,38.0,0.395,13,16,0.813,8.0,36.0,44.0,21,8.0,4,14,20,12.0,1610612737,ATL,Atlanta Hawks,L,240,98,37,85,0.435,18,35.0,0.514,6,10,0.6,6.0,26.0,32.0,22,6.0,5,11,18,1,208
22100182,2021-11-12,2021-22,1610612743,DEN,Denver Nuggets,W,240,105,40,90,0.444,10,35.0,0.286,15,20,0.75,10.0,39.0,49.0,26,6.0,5,7,21,9.0,1610612737,ATL,Atlanta Hawks,L,238,96,38,93,0.409,5,28.0,0.179,15,21,0.714,14.0,36.0,50.0,20,2.0,5,8,20,1,201
22100277,2021-11-24,2021-22,1610612759,SAS,San Antonio Spurs,L,240,106,43,100,0.43,12,33.0,0.364,8,10,0.8,15.0,29.0,44.0,27,8.0,4,12,16,-18.0,1610612737,ATL,Atlanta Hawks,W,239,124,45,88,0.511,12,26.0,0.462,22,24,0.917,8.0,36.0,44.0,26,10.0,5,9,11,0,230


In [26]:
# apply grouping for rolling calculation
group_cols = [team_col, 'SEASON_ID'] if stratify_by_season else [team_col]
print(group_cols)

# group by 'group_cols' so that you only have home or away team stats stratified by season
rolling_stats = df_sorted.groupby(group_cols)


['AWAY_TEAM_NAME', 'SEASON_ID']


In [27]:
# this cell is to double check the work of the full rolling diff

# Filter rolling_stats df to only show numerical stats
# Then take the difference between columns
# ie HOME_PTS - AWAY_PTS, and roll the calculation rightward
# Afterwards filter the columns to only display home stats, as the away stats will have become nan 
rolling_stats[filtered_stats_cols_both].diff(axis=1,periods=18)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,HOME_PF,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF
AWAY_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
Atlanta Hawks,2021-22,22100027,2021-10-23,,,,,,,,,,,,,,,,,,,-6,1,10,-0.032,3,6.0,0.044,-11,-12,-0.141,5.0,-5.0,0.0,-3,-1.0,-3,-1,7
Atlanta Hawks,2021-22,22100059,2021-10-27,,,,,,,,,,,,,,,,,,,3,0,3,-0.013,-3,-6.0,-0.039,6,8,-0.065,12.0,-1.0,11.0,-3,-1.0,-2,2,-5
Atlanta Hawks,2021-22,22100066,2021-10-28,,,,,,,,,,,,,,,,,,,-11,2,-11,0.080,-7,-13.0,-0.096,-8,-7,-0.167,-8.0,0.0,-8.0,0,-2.0,-1,7,0
Atlanta Hawks,2021-22,22100082,2021-10-30,,,,,,,,,,,,,,,,,,,-28,-10,9,-0.156,-4,-16.0,0.048,-4,0,-0.210,15.0,-3.0,12.0,0,-5.0,-5,7,-7
Atlanta Hawks,2021-22,22100113,2021-11-03,,,,,,,,,,,,,,,,,,,-9,-2,6,-0.053,-9,-13.0,-0.087,4,3,0.117,5.0,5.0,10.0,-11,1.0,-4,0,-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Washington Wizards,2023-24,22300520,2024-01-10,,,,,,,,,,,,,,,,,,,-8,-3,-4,-0.014,-1,6.0,-0.094,-1,6,-0.217,-1.0,-7.0,-8.0,-7,1.0,2,0,1
Washington Wizards,2023-24,22300543,2024-01-13,,,,,,,,,,,,,,,,,,,28,15,4,0.141,1,-8.0,0.100,-3,-7,0.081,-4.0,8.0,4.0,11,1.0,2,-2,2
Washington Wizards,2023-24,22300579,2024-01-18,,,,,,,,,,,,,,,,,,,-4,2,4,0.003,-3,-7.0,-0.019,-5,-4,-0.073,2.0,-3.0,-1.0,9,3.0,-1,-5,6
Washington Wizards,2023-24,22300642,2024-01-27,,,,,,,,,,,,,,,,,,,14,8,15,0.015,2,2.0,0.043,-4,-6,0.032,5.0,0.0,5.0,4,8.0,-4,-8,-1


In [28]:
# Run full the sorting, difference, and rolling average functions from df_sorted
rolling_diff_stats_away = (df_sorted.groupby(group_cols)[filtered_stats_cols_both]
                                .diff(axis=1,periods=18)[filtered_stats_cols_away]
                                .rolling(window=5, min_periods=1)
                                .mean()
                                .round(5)
                                .groupby(group_cols) # need to groupby again for shift
                                .shift(1)  # lag to exclude the current game from the rolling average
                                .add_prefix('ROLLDIFF_')
                                .reset_index() # reset the index to convert GAME_ID back into a column
                     )

# check if rolling diff stats has worked
rolling_diff_stats_away

Unnamed: 0,AWAY_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,ROLLDIFF_AWAY_PTS,ROLLDIFF_AWAY_FGM,ROLLDIFF_AWAY_FGA,ROLLDIFF_AWAY_FG_PCT,ROLLDIFF_AWAY_FG3M,ROLLDIFF_AWAY_FG3A,ROLLDIFF_AWAY_FG3_PCT,ROLLDIFF_AWAY_FTM,ROLLDIFF_AWAY_FTA,ROLLDIFF_AWAY_FT_PCT,ROLLDIFF_AWAY_OREB,ROLLDIFF_AWAY_DREB,ROLLDIFF_AWAY_REB,ROLLDIFF_AWAY_AST,ROLLDIFF_AWAY_STL,ROLLDIFF_AWAY_BLK,ROLLDIFF_AWAY_TOV,ROLLDIFF_AWAY_PF
0,Atlanta Hawks,2021-22,22100027,2021-10-23,,,,,,,,,,,,,,,,,,
1,Atlanta Hawks,2021-22,22100059,2021-10-27,-6.000,1.00,10.000,-0.032,3.000,6.000,0.044,-11.000,-12.000,-0.141,5.0,-5.00,0.00,-3.0,-1.000,-3.00,-1.000,7.000
2,Atlanta Hawks,2021-22,22100066,2021-10-28,-1.500,0.50,6.500,-0.022,0.000,0.000,0.003,-2.500,-2.000,-0.103,8.5,-3.00,5.50,-3.0,-1.000,-2.50,0.500,1.000
3,Atlanta Hawks,2021-22,22100082,2021-10-30,-4.667,1.00,0.667,0.012,-2.333,-4.333,-0.030,-4.333,-3.667,-0.124,3.0,-2.00,1.00,-2.0,-1.333,-2.00,2.667,0.667
4,Atlanta Hawks,2021-22,22100113,2021-11-03,-10.500,-1.75,2.750,-0.030,-2.750,-7.250,-0.011,-4.250,-2.750,-0.146,6.0,-2.25,3.75,-1.5,-2.250,-2.75,3.750,-1.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3167,Washington Wizards,2023-24,22300520,2024-01-10,-13.000,-5.40,-4.200,-0.032,-1.800,-1.400,-0.026,-0.400,0.400,-0.061,-6.2,-8.00,-14.20,-7.0,0.400,0.00,-2.000,-0.800
3168,Washington Wizards,2023-24,22300543,2024-01-13,-13.800,-5.60,-6.000,-0.026,-3.400,-3.200,-0.060,0.800,3.600,-0.111,-5.8,-8.00,-13.80,-7.4,0.600,0.40,-0.600,-2.400
3169,Washington Wizards,2023-24,22300579,2024-01-18,-5.800,-0.80,-2.200,0.005,-3.400,-3.800,-0.056,-0.800,1.400,-0.108,-5.8,-6.60,-12.40,-3.4,1.800,0.60,-1.800,-1.400
3170,Washington Wizards,2023-24,22300642,2024-01-27,-6.800,-0.20,0.800,-0.004,-3.600,-5.200,-0.046,-2.800,-1.200,-0.112,-2.8,-7.00,-9.80,-2.0,2.000,0.40,-2.000,1.800


## Checking by game ID

In [41]:
df_check_away = rolling_diff_stats_away[rolling_diff_stats_away['GAME_ID'] == 22300704]
df_check_home = rolling_diff_stats_home[rolling_diff_stats_home['GAME_ID'] == 22300704]

In [42]:
df_check_home

Unnamed: 0,HOME_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,ROLLDIFF_HOME_PTS,ROLLDIFF_HOME_FGM,ROLLDIFF_HOME_FGA,ROLLDIFF_HOME_FG_PCT,ROLLDIFF_HOME_FG3M,ROLLDIFF_HOME_FG3A,ROLLDIFF_HOME_FG3_PCT,ROLLDIFF_HOME_FTM,ROLLDIFF_HOME_FTA,ROLLDIFF_HOME_FT_PCT,ROLLDIFF_HOME_OREB,ROLLDIFF_HOME_DREB,ROLLDIFF_HOME_REB,ROLLDIFF_HOME_AST,ROLLDIFF_HOME_STL,ROLLDIFF_HOME_BLK,ROLLDIFF_HOME_TOV,ROLLDIFF_HOME_PF
954,Detroit Pistons,2023-24,22300704,2024-02-04,-2.2,-2.0,-3.6,-0.005,0.8,-0.4,0.023,1.0,2.8,-0.044,-0.6,-0.2,-0.8,0.2,-4.0,-1.0,3.0,-2.2


In [43]:
df_check_away

Unnamed: 0,AWAY_TEAM_NAME,SEASON_ID,GAME_ID,GAME_DATE,ROLLDIFF_AWAY_PTS,ROLLDIFF_AWAY_FGM,ROLLDIFF_AWAY_FGA,ROLLDIFF_AWAY_FG_PCT,ROLLDIFF_AWAY_FG3M,ROLLDIFF_AWAY_FG3A,ROLLDIFF_AWAY_FG3_PCT,ROLLDIFF_AWAY_FTM,ROLLDIFF_AWAY_FTA,ROLLDIFF_AWAY_FT_PCT,ROLLDIFF_AWAY_OREB,ROLLDIFF_AWAY_DREB,ROLLDIFF_AWAY_REB,ROLLDIFF_AWAY_AST,ROLLDIFF_AWAY_STL,ROLLDIFF_AWAY_BLK,ROLLDIFF_AWAY_TOV,ROLLDIFF_AWAY_PF
2320,Orlando Magic,2023-24,22300704,2024-02-04,1.4,0.8,6.2,-0.031,-0.8,-2.0,-0.008,0.6,1.2,-0.008,4.8,-4.0,0.8,-1.4,2.4,-0.6,-4.4,0.8
