<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (2022-2024)</p>
</div>

<a name="Feature-Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups and Targets](#1.-Create-Team-Matchups-and-Targets)**

- [1.1. Clean Game Data](#1.1.-Clean-Game-Data)

- [1.2. Reshape to Game Matchups](#1.2.-Reshape-to-Game-Matchups)

- [1.3. Create Target Variables](#1.3.-Create-Target-Variables)

**[2. Create Rolling Window Statistics](#2.-Create-Rolling-Window-Statistics)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

# Data

[Return to top](#Feature-Engineering)

In [25]:
hustle_stats_df = pd.read_csv('../../data/original/nba_hustle_statistics_2021_2024.csv')
hustle_stats_df.rename(columns={'MINUTES':'MIN'}, inplace=True)
hustle_stats_df['MIN'] = hustle_stats_df['MIN'].str.slice(0, 3)
hustle_stats_df = hustle_stats_df.drop(hustle_stats_df[hustle_stats_df['MIN'] == '0.0'].index)
hustle_stats_df = hustle_stats_df.drop(hustle_stats_df[hustle_stats_df['MIN'] == '80.'].index)
hustle_stats_df['MIN'] = hustle_stats_df['MIN'].astype(int)
hustle_stats_df.head()

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_NAME,TEAM_ABBREVIATION,MIN,PTS,CONTESTED_SHOTS,CONTESTED_SHOTS_2PT,CONTESTED_SHOTS_3PT,DEFLECTIONS,CHARGES_DRAWN,SCREEN_ASSISTS,SCREEN_AST_PTS,OFF_LOOSE_BALLS_RECOVERED,DEF_LOOSE_BALLS_RECOVERED,LOOSE_BALLS_RECOVERED,OFF_BOXOUTS,DEF_BOXOUTS,BOX_OUT_PLAYER_TEAM_REBS,BOX_OUT_PLAYER_REBS,BOX_OUTS,SEASON_ID,GAME_DATE,MATCHUP
0,22101221,1610612737,Atlanta Hawks,ATL,240,130,45,21,24,12,0,4,10,0,1,1,1,11,9,3,12,22021,2022-04-10,ATL @ HOU
1,22101221,1610612745,Houston Rockets,HOU,240,114,42,24,18,3,0,18,44,5,1,6,1,4,4,3,5,22021,2022-04-10,HOU vs. ATL
2,22101207,1610612737,Atlanta Hawks,ATL,240,109,40,25,15,19,1,7,15,4,3,7,1,6,6,4,7,22021,2022-04-08,ATL @ MIA
3,22101207,1610612748,Miami Heat,MIA,240,113,53,31,22,17,0,9,21,3,2,5,0,10,10,2,10,22021,2022-04-08,MIA vs. ATL
4,22101192,1610612764,Washington Wizards,WAS,240,103,49,27,22,9,0,7,15,2,2,4,2,14,16,10,16,22021,2022-04-06,WAS @ ATL


In [27]:
box_score_df = pd.read_csv('../../data/original/nba_games_box_scores_2022_2024.csv')

In [28]:
box_score_df.tail()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
7520,22023,1610612764,WAS,Washington Wizards,22300642,2024-01-27,WAS @ DET,W,240,118,45,100,0.45,11,34.0,0.324,17,21,0.81,16.0,34.0,50.0,26,10.0,4,9,19,14.0
7521,22023,1610612764,WAS,Washington Wizards,22300665,2024-01-29,WAS @ SAS,W,240,118,46,86,0.535,9,25.0,0.36,17,24,0.708,14.0,31.0,45.0,32,9.0,8,18,15,5.0
7522,22023,1610612764,WAS,Washington Wizards,22300676,2024-01-31,WAS vs. LAC,L,239,109,45,97,0.464,9,29.0,0.31,10,15,0.667,12.0,33.0,45.0,19,4.0,10,13,19,-16.0
7523,22023,1610612764,WAS,Washington Wizards,22300689,2024-02-02,WAS vs. MIA,L,239,102,37,90,0.411,11,42.0,0.262,17,21,0.81,6.0,37.0,43.0,28,5.0,4,8,25,-8.0
7524,22023,1610612764,WAS,Washington Wizards,22300705,2024-02-04,WAS vs. PHX,L,240,112,47,96,0.49,7,32.0,0.219,11,17,0.647,13.0,22.0,35.0,32,11.0,4,18,19,-28.0


<a name="1.-Create-Team-Matchups-and-Targets"></a>
# 1. Create Team Matchups and Targets

[Return to top](#Feature-Engineering)

<a name="1.1.-Clean-Game-Data"></a>
## 1.1. Clean Game Data

[Return to top](#Feature-Engineering)

We need to do three key things to clean the data:

1. Remove games with team aggregated game times of less than 238 minutes (which will remove exhibition matches).
2. Retain only games that are part of the regular season.
3. Remove any orphans (i.e., game IDs that do not have a partner) when reshaping to matchups.

Last 3 NBA regular seasons start and end dates:

- 2021-22 season: 2021-10-19 to 2022-04-10
- 2022-23 season: 2022-10-18 to 2023-04-09
- 2023-24 season: 2023-10-24 to 2024-04-14

In [29]:
# last 3 seasons start and end dates and labels
season_start_dates = ['2021-10-19', '2022-10-18', '2023-10-24']
season_end_dates   = ['2022-04-10', '2023-04-09', '2024-04-14']
season_labels      = ['2021-22', '2022-23', '2023-24']

In [30]:
# clean up the data
hustle_stats_df_cleaned = utl.clean_team_bs_data(hustle_stats_df, season_start_dates=season_start_dates, 
                                            season_end_dates=season_end_dates, season_labels=season_labels)

Season 2021-22: 1230 games
Season 2022-23: 1230 games
Season 2023-24: 842 games


In [31]:
# clean up the data
box_score_df_cleaned = utl.clean_team_bs_data(box_score_df, season_start_dates=season_start_dates, 
                                            season_end_dates=season_end_dates, season_labels=season_labels)

Season 2021-22: 1230 games
Season 2022-23: 1230 games
Season 2023-24: 736 games


In [32]:
#get GAME_DATE, MATCHUP, GAME_ID, TEAM_ABBREVIATION fields from games_df
hustle_stats_df_cleaned = pd.merge(hustle_stats_df_cleaned, box_score_df_cleaned[['WL','GAME_ID','TEAM_ID', 'TEAM_ABBREVIATION', 'PLUS_MINUS']], on=['GAME_ID','TEAM_ID', 'TEAM_ABBREVIATION'])

hustle_stats_df_cleaned.sort_values(by=['PTS'])

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_NAME,TEAM_ABBREVIATION,MIN,PTS,CONTESTED_SHOTS,CONTESTED_SHOTS_2PT,CONTESTED_SHOTS_3PT,DEFLECTIONS,CHARGES_DRAWN,SCREEN_ASSISTS,SCREEN_AST_PTS,OFF_LOOSE_BALLS_RECOVERED,DEF_LOOSE_BALLS_RECOVERED,LOOSE_BALLS_RECOVERED,OFF_BOXOUTS,DEF_BOXOUTS,BOX_OUT_PLAYER_TEAM_REBS,BOX_OUT_PLAYER_REBS,BOX_OUTS,SEASON_ID,GAME_DATE,MATCHUP,WL,PLUS_MINUS
889,22100075,1610612742,Dallas Mavericks,DAL,240,75,38,22,16,18,0,6,15,2,1,3,0,8,4,1,8,2021-22,2021-10-29,DAL @ DEN,L,-31.0
224,22100717,1610612758,Sacramento Kings,SAC,240,75,72,39,33,17,0,4,9,4,5,9,1,6,6,2,7,2021-22,2022-01-25,SAC @ BOS,L,-53.0
242,22100595,1610612752,New York Knicks,NYK,240,75,46,27,19,14,0,10,22,1,3,4,4,11,14,6,15,2021-22,2022-01-08,NYK @ BOS,L,-24.0
787,22100988,1610612742,Dallas Mavericks,DAL,240,77,46,28,18,14,1,10,22,2,1,3,1,9,8,3,10,2021-22,2022-03-09,DAL vs. NYK,L,-30.0
727,22100257,1610612741,Chicago Bulls,CHI,240,77,51,33,18,15,0,7,15,1,0,1,2,9,9,0,11,2021-22,2021-11-22,CHI vs. IND,L,-32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2316,22100899,1610612759,San Antonio Spurs,SAS,290,157,68,40,28,13,0,17,37,4,5,9,3,6,8,5,9,2021-22,2022-02-25,SAS @ WAS,W,4.0
3480,22201230,1610612744,Golden State Warriors,GSW,240,157,59,37,22,14,0,8,21,2,4,6,4,4,7,5,8,2022-23,2023-04-09,GSW @ POR,W,56.0
2048,22100723,1610612766,Charlotte Hornets,CHA,240,158,44,25,19,15,1,7,18,5,5,10,1,5,4,0,6,2021-22,2022-01-26,CHA @ IND,W,32.0
3751,22200902,1610612746,LA Clippers,LAC,290,175,51,27,24,18,1,4,9,3,4,7,1,5,6,3,6,2022-23,2023-02-24,LAC vs. SAC,L,-1.0


<a name="1.2.-Reshape-to-Game-Matchups"></a>
## 1.2. Reshape to Game Matchups

[Return to top](#Feature-Engineering)

In [33]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# reshape team box score data to wide format so each row is a game matchup
hustle_stats_matchups_df = utl.reshape_team_bs_to_matchups(hustle_stats_df_cleaned, non_stats_cols)

Season 2021-22: 1222 games
Season 2022-23: 1221 games
Season 2023-24: 728 games


In [34]:
hustle_stats_matchups_df.head()

Unnamed: 0,GAME_ID,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBREVIATION,HOME_MIN,HOME_PTS,HOME_CONTESTED_SHOTS,HOME_CONTESTED_SHOTS_2PT,HOME_CONTESTED_SHOTS_3PT,HOME_DEFLECTIONS,HOME_CHARGES_DRAWN,HOME_SCREEN_ASSISTS,HOME_SCREEN_AST_PTS,HOME_OFF_LOOSE_BALLS_RECOVERED,HOME_DEF_LOOSE_BALLS_RECOVERED,HOME_LOOSE_BALLS_RECOVERED,HOME_OFF_BOXOUTS,HOME_DEF_BOXOUTS,HOME_BOX_OUT_PLAYER_TEAM_REBS,HOME_BOX_OUT_PLAYER_REBS,HOME_BOX_OUTS,SEASON_ID,GAME_DATE,HOME_WL,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBREVIATION,AWAY_MIN,AWAY_PTS,AWAY_CONTESTED_SHOTS,AWAY_CONTESTED_SHOTS_2PT,AWAY_CONTESTED_SHOTS_3PT,AWAY_DEFLECTIONS,AWAY_CHARGES_DRAWN,AWAY_SCREEN_ASSISTS,AWAY_SCREEN_AST_PTS,AWAY_OFF_LOOSE_BALLS_RECOVERED,AWAY_DEF_LOOSE_BALLS_RECOVERED,AWAY_LOOSE_BALLS_RECOVERED,AWAY_OFF_BOXOUTS,AWAY_DEF_BOXOUTS,AWAY_BOX_OUT_PLAYER_TEAM_REBS,AWAY_BOX_OUT_PLAYER_REBS,AWAY_BOX_OUTS,AWAY_WL,AWAY_PLUS_MINUS
0,22101221,1610612745,Houston Rockets,HOU,240,114,42,24,18,3,0,18,44,5,1,6,1,4,4,3,5,2021-22,2022-04-10,L,-16.0,1610612737,Atlanta Hawks,ATL,240,130,45,21,24,12,0,4,10,0,1,1,1,11,9,3,12,W,16.0
1,22101207,1610612748,Miami Heat,MIA,240,113,53,31,22,17,0,9,21,3,2,5,0,10,10,2,10,2021-22,2022-04-08,W,4.0,1610612737,Atlanta Hawks,ATL,240,109,40,25,15,19,1,7,15,4,3,7,1,6,6,4,7,L,-4.0
2,22101192,1610612737,Atlanta Hawks,ATL,240,118,48,28,20,14,0,10,26,2,1,3,3,7,10,6,10,2021-22,2022-04-06,W,15.0,1610612764,Washington Wizards,WAS,240,103,49,27,22,9,0,7,15,2,2,4,2,14,16,10,16,L,-15.0
3,22101182,1610612761,Toronto Raptors,TOR,240,118,60,39,21,6,0,4,8,2,4,6,3,11,13,6,14,2021-22,2022-04-05,W,10.0,1610612737,Atlanta Hawks,ATL,240,108,64,42,22,7,0,7,16,3,2,5,0,13,12,7,13,L,-10.0
4,22101163,1610612737,Atlanta Hawks,ATL,240,122,58,37,21,13,0,6,13,2,6,8,2,9,11,7,11,2021-22,2022-04-02,W,7.0,1610612751,Brooklyn Nets,BKN,240,115,51,28,23,11,0,14,34,2,3,5,3,12,15,5,15,L,-7.0


<a name="1.3.-Create-Target-Variables"></a>
## 1.3. Create Target Variables

[Return to top](#Feature-Engineering)

There are three targets of interest:

1. **Total Game Points (over / under):** This can be calculated as the sum of `HOME_PTS + AWAY_PTS`.
2. **Difference in Game Points (plus / minus):** This can be calculated in relation to the home team as the following difference: `HOME_PTS - AWAY_PTS`.
3. **Game Winner (moneyline):** This can be defined in relation to the home team using the `HOME_WL` column, where a win for the home team is equal to 1 and a loss for the home team equal to 0. We will create a new column called `GAME_RESULT` for this indicator.

In [35]:
# create the above three target variables
hustle_stats_matchups_df = utl.create_target_variables(hustle_stats_matchups_df, 'HOME_WL', 'HOME_PTS', 'AWAY_PTS')

In [36]:
hustle_stats_matchups_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS']].tail()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,GAME_RESULT,TOTAL_PTS,PLUS_MINUS
3166,2023-11-22,22300225,Charlotte Hornets,Washington Wizards,117,114,1,231,3.0
3167,2023-11-10,22300009,Washington Wizards,Charlotte Hornets,117,124,0,241,-7.0
3168,2023-11-08,22300157,Charlotte Hornets,Washington Wizards,116,132,0,248,-16.0
3169,2024-01-24,22300619,Detroit Pistons,Charlotte Hornets,113,106,1,219,7.0
3170,2023-10-27,22300077,Charlotte Hornets,Detroit Pistons,99,111,0,210,-12.0


<a name="2.-Create-Rolling-Window-Statistics"></a>
# 2. Create Rolling Window Statistics

[Return to top](#Feature-Engineering)

Here we create average box scores for each team over a rolling window of the previous $n$-games.

In [37]:
# identify stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'AWAY_TEAM_ID',
                  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_WL', 'AWAY_WL', 'HOME_MIN', 
                  'AWAY_MIN', 'HOME_TEAM_ABBREVIATION', 'AWAY_TEAM_ABBREVIATION']
stats_cols = [col for col in hustle_stats_matchups_df.columns if col not in non_stats_cols]

In [44]:
# calculate rolling averages for each statistic and add them to the DataFrame
hustle_stats_matchups_roll_df = utl.process_rolling_stats(
    hustle_stats_matchups_df, 
    stats_cols, 
    target_cols=['GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS'],
    window_size=5,   # the number of games to include in the rolling window
    min_obs=1,       # the minimum number of observations present within the window to yield an aggregate value
    stratify_by_season=True,  # should the rolling calculations be reset at the start of each new season or be contiguous across seasons? 
    exclude_initial_games=0   # number of initial games to exclude from the rolling averages (optionally by season)
)

In [45]:
hustle_stats_matchups_roll_df.tail()

Unnamed: 0,GAME_ID,GAME_RESULT,TOTAL_PTS,PLUS_MINUS,HOME_TEAM_NAME,SEASON_ID,GAME_DATE,ROLL_HOME_PTS,ROLL_HOME_CONTESTED_SHOTS,ROLL_HOME_CONTESTED_SHOTS_2PT,ROLL_HOME_CONTESTED_SHOTS_3PT,ROLL_HOME_DEFLECTIONS,ROLL_HOME_CHARGES_DRAWN,ROLL_HOME_SCREEN_ASSISTS,ROLL_HOME_SCREEN_AST_PTS,ROLL_HOME_OFF_LOOSE_BALLS_RECOVERED,ROLL_HOME_DEF_LOOSE_BALLS_RECOVERED,ROLL_HOME_LOOSE_BALLS_RECOVERED,ROLL_HOME_OFF_BOXOUTS,ROLL_HOME_DEF_BOXOUTS,ROLL_HOME_BOX_OUT_PLAYER_TEAM_REBS,ROLL_HOME_BOX_OUT_PLAYER_REBS,ROLL_HOME_BOX_OUTS,AWAY_TEAM_NAME,ROLL_AWAY_PTS,ROLL_AWAY_CONTESTED_SHOTS,ROLL_AWAY_CONTESTED_SHOTS_2PT,ROLL_AWAY_CONTESTED_SHOTS_3PT,ROLL_AWAY_DEFLECTIONS,ROLL_AWAY_CHARGES_DRAWN,ROLL_AWAY_SCREEN_ASSISTS,ROLL_AWAY_SCREEN_AST_PTS,ROLL_AWAY_OFF_LOOSE_BALLS_RECOVERED,ROLL_AWAY_DEF_LOOSE_BALLS_RECOVERED,ROLL_AWAY_LOOSE_BALLS_RECOVERED,ROLL_AWAY_OFF_BOXOUTS,ROLL_AWAY_DEF_BOXOUTS,ROLL_AWAY_BOX_OUT_PLAYER_TEAM_REBS,ROLL_AWAY_BOX_OUT_PLAYER_REBS,ROLL_AWAY_BOX_OUTS
2540,22300703,0,218,-16.0,San Antonio Spurs,2023-24,2024-02-03,114.6,43.65,28.85,14.8,13.95,0.3,7.3,16.65,2.85,2.6,5.45,1.15,4.1,5.1,2.2,5.25,Cleveland Cavaliers,112.8,43.65,26.8,16.85,14.2,0.5,10.1,23.4,2.45,2.65,5.1,1.15,7.4,8.3,3.85,8.55
3077,22300705,0,252,-28.0,Washington Wizards,2023-24,2024-02-04,115.1,44.0,29.05,14.95,12.95,0.4,9.65,21.7,2.4,2.75,5.15,1.05,5.8,6.45,2.95,6.85,Phoenix Suns,119.85,42.35,27.75,14.6,14.5,0.25,10.6,23.45,3.45,3.5,6.95,1.45,5.95,6.8,3.35,7.4
3017,22300704,0,210,-12.0,Detroit Pistons,2023-24,2024-02-04,113.05,52.45,33.7,18.75,11.85,0.1,9.8,22.05,3.5,2.6,6.1,1.85,5.8,7.55,4.1,7.65,Orlando Magic,111.7,40.0,23.7,16.3,14.9,0.55,8.55,18.7,2.3,2.9,5.2,2.3,5.35,7.3,3.25,7.65
3036,22300707,0,214,-18.6,Charlotte Hornets,2023-24,2024-02-04,107.55,46.25,27.2,19.05,12.1,0.75,9.65,22.0,2.35,2.4,4.75,1.75,5.45,6.95,3.35,7.2,Indiana Pacers,121.9,46.7,30.45,16.25,13.9,0.35,7.75,16.6,2.7,3.0,5.7,1.75,4.8,6.1,3.4,6.55
2492,22300706,1,222,40.0,Boston Celtics,2023-24,2024-02-04,121.05,47.15,30.1,17.05,14.1,0.45,8.3,20.4,2.7,2.8,5.5,1.4,6.9,8.0,4.05,8.3,Memphis Grizzlies,110.1,44.6,28.4,16.2,13.7,0.7,7.5,17.25,2.65,2.85,5.5,1.05,4.6,5.55,2.75,5.65


In [46]:
# write out the matchups with rolling features
hustle_stats_matchups_roll_df.to_csv('../../data/processed/nba_team_matchups_rolling_hustle_stats_2021_2024_r05.csv', index=False)