<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (2022-2024)</p>
</div>

<a name="Feature Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups and Targets](#1.-Create-Team-Matchups-and-Targets)**

- [1.1. Clean Game Data](#1.1.-Clean-Game-Data)

- [1.2. Reshape to Game Matchups](#1.2.-Reshape-to-Game-Matchups)

- [1.3. Create Target Variables](#1.3.-Create-Target-Variables)

**[2. Create Rolling Window Statistics](#2.-Create-Rolling-Window-Statistics)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
# basic modules
import os
import time
import random as rn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# plotting style
plt.style.use('seaborn-v0_8-notebook')
sns.set_style('white')
#sns.set_style('darkgrid')

# pandas tricks for better display
pd.options.display.max_columns = 50  
pd.options.display.max_rows = 500     
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

# preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# warnings
import warnings
warnings.filterwarnings("ignore")

# user defined functions
import utility_functions as utl

# Data

[Return to top](#Feature-Engineering)

In [2]:
team_bs_df = pd.read_csv('../data/original/nba_games_box_scores_2022_2024.csv')

In [3]:
team_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7525 entries, 0 to 7524
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SEASON_ID          7525 non-null   int64  
 1   TEAM_ID            7525 non-null   int64  
 2   TEAM_ABBREVIATION  7525 non-null   object 
 3   TEAM_NAME          7525 non-null   object 
 4   GAME_ID            7525 non-null   int64  
 5   GAME_DATE          7525 non-null   object 
 6   MATCHUP            7525 non-null   object 
 7   WL                 7514 non-null   object 
 8   MIN                7525 non-null   int64  
 9   PTS                7525 non-null   int64  
 10  FGM                7525 non-null   int64  
 11  FGA                7525 non-null   int64  
 12  FG_PCT             7523 non-null   float64
 13  FG3M               7525 non-null   int64  
 14  FG3A               7525 non-null   float64
 15  FG3_PCT            7523 non-null   float64
 16  FTM                7525 

<a name="1.-Create-Team-Matchups-and-Targets"></a>
# 1. Create Team Matchups and Targets

[Return to top](#Feature-Engineering)

<a name="1.1.-Clean-Game-Data"></a>
## 1.1. Clean Game Data

[Return to top](#Feature-Engineering)

We need to do three key things to clean the data:

1. Remove games with team aggregated game times of less than 238 minutes (which will remove exhibition matches).
2. Retain only games that are part of the regular season.
3. Remove any orphans (i.e., game IDs that do not have a partner) when reshaping to matchups.

Last 3 NBA regular seasons start and end dates:

- 2021-22 season: 2021-10-19 to 2022-04-10
- 2022-23 season: 2022-10-18 to 2023-04-09
- 2023-24 season: 2023-10-24 to 2024-04-14

In [4]:
# last 3 seasons start and end dates and labels
season_start_dates = ['2021-10-19', '2022-10-18', '2023-10-24']
season_end_dates   = ['2022-04-10', '2023-04-09', '2024-04-14']
season_labels      = ['2021-22', '2022-23', '2023-24']

In [5]:
# clean up the data
team_bs_df_cleaned = utl.clean_team_bs_data(team_bs_df, season_start_dates=season_start_dates, 
                                            season_end_dates=season_end_dates, season_labels=season_labels)

Season 2021-22: 1230 games
Season 2022-23: 1230 games
Season 2023-24: 736 games


<a name="1.2.-Reshape-to-Game-Matchups"></a>
## 1.2. Reshape to Game Matchups

[Return to top](#Feature-Engineering)

In [6]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# reshape team box score data to wide format so each row is a game matchup
team_bs_matchups_df = utl.reshape_team_bs_to_matchups(team_bs_df_cleaned, non_stats_cols)

Season 2021-22: 1222 games
Season 2022-23: 1221 games
Season 2023-24: 729 games


In [7]:
team_bs_matchups_df.head()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,AWAY_PLUS_MINUS
0,2021-22,1610612737,ATL,Atlanta Hawks,22100014,2021-10-21,W,242,113,45,94,0.479,15,35.0,0.429,8,9,0.889,6.0,49.0,55.0,31,8.0,9,13,...,26.0,1610612742,DAL,Dallas Mavericks,L,240,87,31,93,0.333,13,43.0,0.302,12,13,0.923,10.0,40.0,50.0,16,7.0,3,15,21,-26.0
1,2021-22,1610612737,ATL,Atlanta Hawks,22100043,2021-10-25,W,238,122,46,90,0.511,12,32.0,0.375,18,21,0.857,10.0,39.0,49.0,24,11.0,3,13,...,18.0,1610612765,DET,Detroit Pistons,L,239,104,40,91,0.44,9,33.0,0.273,15,18,0.833,11.0,25.0,36.0,26,7.0,6,14,15,-18.0
2,2021-22,1610612737,ATL,Atlanta Hawks,22100097,2021-11-01,W,240,118,38,83,0.458,13,34.0,0.382,29,29,1.0,13.0,34.0,47.0,24,9.0,5,11,...,7.0,1610612764,WAS,Washington Wizards,L,240,111,41,86,0.477,13,39.0,0.333,16,16,1.0,7.0,29.0,36.0,27,7.0,4,12,24,-7.0
3,2021-22,1610612737,ATL,Atlanta Hawks,22100120,2021-11-04,L,240,98,35,82,0.427,7,28.0,0.25,21,26,0.808,5.0,27.0,32.0,18,11.0,4,9,...,-18.0,1610612762,UTA,Utah Jazz,W,240,116,41,81,0.506,17,41.0,0.415,17,23,0.739,8.0,38.0,46.0,30,6.0,4,14,20,18.0
4,2021-22,1610612737,ATL,Atlanta Hawks,22100193,2021-11-14,W,241,120,47,97,0.485,15,35.0,0.429,11,13,0.846,15.0,36.0,51.0,21,6.0,1,12,...,20.0,1610612749,MIL,Milwaukee Bucks,L,240,100,38,84,0.452,14,41.0,0.341,10,16,0.625,4.0,26.0,30.0,24,8.0,3,11,17,-20.0


<a name="1.3.-Create-Target-Variables"></a>
## 1.3. Create Target Variables

[Return to top](#Feature-Engineering)

There are three targets of interest:

1. **Total Game Points (over / under):** This can be calculated as the sum of `HOME_PTS + AWAY_PTS`.
2. **Difference in Game Points (plus / minus):** This can be calculated in relation to the home team as the following difference: `HOME_PTS - AWAY_PTS`.
3. **Game Winner (moneyline):** This can be defined in relation to the home team using the `HOME_WL` column, where a win for the home team is equal to 1 and a loss for the home team equal to 0. We will create a new column called `GAME_RESULT` for this indicator.

In [8]:
# create the above three target variables
team_bs_matchups_df = utl.create_target_variables(team_bs_matchups_df, 'HOME_WL', 'HOME_PTS', 'AWAY_PTS')

In [9]:
team_bs_matchups_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS']].tail()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,GAME_RESULT,TOTAL_PTS,PLUS_MINUS
3167,2024-01-24,22300620,Washington Wizards,Minnesota Timberwolves,107,118,0,225,-11.0
3168,2024-01-25,22300628,Washington Wizards,Utah Jazz,108,123,0,231,-15.0
3169,2024-01-31,22300676,Washington Wizards,LA Clippers,109,125,0,234,-16.0
3170,2024-02-02,22300689,Washington Wizards,Miami Heat,102,110,0,212,-8.0
3171,2024-02-04,22300705,Washington Wizards,Phoenix Suns,112,140,0,252,-28.0


<a name="2.-Create-Rolling-Window-Statistics"></a>
# 2. Create Rolling Window Statistics

[Return to top](#Feature-Engineering)

Here we create average box scores for each team over a rolling window of the previous $n$-games.

In [10]:
# identify stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'AWAY_TEAM_ID',
                  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_WL', 'AWAY_WL', 'HOME_MIN', 
                  'AWAY_MIN', 'HOME_TEAM_ABBREVIATION', 'AWAY_TEAM_ABBREVIATION']
stats_cols = [col for col in team_bs_matchups_df.columns if col not in non_stats_cols]

In [11]:
# calculate rolling averages for each statistic and add them to the DataFrame
team_bs_matchups_roll_df = utl.process_rolling_stats(
    team_bs_matchups_df, 
    stats_cols, 
    target_cols=['GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS'],
    window_size=5,   # the number of games to include in the rolling window
    min_obs=1,       # the minimum number of observations present within the window to yield an aggregate value
    stratify_by_season=True,  # should the rolling calculations be reset at the start of each new season or be contiguous across seasons? 
    exclude_initial_games=0   # number of initial games to exclude from the rolling averages (optionally by season)
)

In [12]:
team_bs_matchups_roll_df.head()

Unnamed: 0,GAME_ID,GAME_RESULT,TOTAL_PTS,PLUS_MINUS,HOME_TEAM_NAME,SEASON_ID,GAME_DATE,ROLL_HOME_PTS,ROLL_HOME_FGM,ROLL_HOME_FGA,ROLL_HOME_FG_PCT,ROLL_HOME_FG3M,ROLL_HOME_FG3A,ROLL_HOME_FG3_PCT,ROLL_HOME_FTM,ROLL_HOME_FTA,ROLL_HOME_FT_PCT,ROLL_HOME_OREB,ROLL_HOME_DREB,ROLL_HOME_REB,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,AWAY_TEAM_NAME,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF
0,22100014,1,200,26.0,Atlanta Hawks,2021-22,2021-10-21,,,,,,,,,,,,,,,,,,,Dallas Mavericks,,,,,,,,,,,,,,,,,,
1,22100043,1,226,18.0,Atlanta Hawks,2021-22,2021-10-25,113.0,45.0,94.0,0.479,15.0,35.0,0.429,8.0,9.0,0.889,6.0,49.0,55.0,31.0,8.0,9.0,13.0,16.0,Detroit Pistons,82.0,34.0,88.0,0.386,5.0,28.0,0.179,9.0,15.0,0.6,11.0,42.0,53.0,14.0,9.0,8.0,20.0,17.0
2,22100097,1,229,7.0,Atlanta Hawks,2021-22,2021-11-01,117.5,45.5,92.0,0.495,13.5,33.5,0.402,13.0,15.0,0.873,8.0,44.0,52.0,27.5,9.5,6.0,13.0,17.5,Washington Wizards,101.333,37.333,86.667,0.434,7.667,30.333,0.259,19.0,25.0,0.761,9.333,41.0,50.333,18.0,9.0,6.667,16.0,18.0
3,22100120,0,214,-18.0,Atlanta Hawks,2021-22,2021-11-04,117.667,43.0,89.0,0.483,13.333,33.667,0.395,18.333,19.667,0.915,9.667,40.667,50.333,26.333,9.333,5.667,12.333,17.333,Utah Jazz,109.5,37.0,85.5,0.435,13.75,40.75,0.336,21.75,24.75,0.886,11.0,40.75,51.75,17.0,8.25,6.0,16.25,17.75
4,22100193,1,220,20.0,Atlanta Hawks,2021-22,2021-11-14,112.75,41.0,87.25,0.469,11.75,32.25,0.359,19.0,21.25,0.889,8.5,37.25,45.75,24.25,9.75,5.25,11.5,19.0,Milwaukee Bucks,110.8,41.4,92.2,0.45,16.8,41.8,0.396,11.2,16.0,0.703,12.4,36.0,48.4,23.6,7.4,4.8,12.0,15.8


In [13]:
# write out the matchups with rolling features
team_bs_matchups_roll_df.to_csv('../data/processed/nba_team_matchups_rolling_box_scores_2022_2024_r05.csv', index=False)