<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (2015-2024)</p>
</div>

<a name="Feature Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups and Targets](#1.-Create-Team-Matchups-and-Targets)**

- [1.1. Clean Game Data](#1.1.-Clean-Game-Data)

- [1.2. Reshape to Game Matchups](#1.2.-Reshape-to-Game-Matchups)

- [1.3. Create Target Variables](#1.3.-Create-Target-Variables)

**[2. Create Rolling Window Statistics](#2.-Create-Rolling-Window-Statistics)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
# basic modules
import os
import time
import random as rn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# plotting style
plt.style.use('seaborn-v0_8-notebook')
sns.set_style('white')
#sns.set_style('darkgrid')

# pandas tricks for better display
pd.options.display.max_columns = 50  
pd.options.display.max_rows = 500     
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

# preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# warnings
import warnings
warnings.filterwarnings("ignore")

# user defined functions
import utility_functions as utl

# Data

[Return to top](#Feature-Engineering)

In [2]:
team_bs_df = pd.read_csv('../data/original/nba_games_box_scores_2015_2024.csv')

In [3]:
team_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27592 entries, 0 to 27591
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SEASON_ID          27592 non-null  int64  
 1   TEAM_ID            27592 non-null  int64  
 2   TEAM_ABBREVIATION  27592 non-null  object 
 3   TEAM_NAME          27592 non-null  object 
 4   GAME_ID            27592 non-null  int64  
 5   GAME_DATE          27592 non-null  object 
 6   MATCHUP            27592 non-null  object 
 7   WL                 27580 non-null  object 
 8   MIN                27592 non-null  int64  
 9   PTS                27592 non-null  int64  
 10  FGM                27592 non-null  int64  
 11  FGA                27592 non-null  int64  
 12  FG_PCT             27589 non-null  float64
 13  FG3M               27592 non-null  int64  
 14  FG3A               27592 non-null  float64
 15  FG3_PCT            27588 non-null  float64
 16  FTM                275

<a name="1.-Create-Team-Matchups-and-Targets"></a>
# 1. Create Team Matchups and Targets

[Return to top](#Feature-Engineering)

<a name="1.1.-Clean-Game-Data"></a>
## 1.1. Clean Game Data

[Return to top](#Feature-Engineering)

We need to do three key things to clean the data:

1. Remove games with team aggregated game times of less than 238 minutes (which will remove exhibition matches).
2. Retain only games that are part of the regular season.
3. Remove any orphans (i.e., game IDs that do not have a partner) when reshaping to matchups.

Last 10 NBA regular seasons start and end dates:

- 2014-15 season: 2014-10-28 to 2015-04-15
- 2015-16 season: 2015-10-27 to 2016-04-13
- 2016-17 season: 2016-10-25 to 2017-04-12
- 2017-28 season: 2017-10-17 to 2018-04-11
- 2018-19 season: 2018-10-16 to 2019-04-07
- 2019-20 season: 2019-10-22 to 2020-03-11 and 2020-07-30 to 2020-08-14
- 2020-21 season: 2020-12-22 to 2021-05-16
- 2021-22 season: 2021-10-19 to 2022-04-10
- 2022-23 season: 2022-10-18 to 2023-04-09
- 2023-24 season: 2023-10-24 to 2024-04-14

In [4]:
# last 10 seasons start and end dates and labels
season_start_dates = ['2014-10-28', '2015-10-27', '2016-10-25', '2017-10-17', '2018-10-16', 
                      '2019-10-22', '2020-07-30', '2020-12-22', '2021-10-19', '2022-10-18', '2023-10-24']
season_end_dates   = ['2015-04-15', '2016-04-13', '2017-04-12', '2018-04-11', '2019-04-07', 
                      '2020-03-11', '2020-08-14', '2021-05-16', '2022-04-10', '2023-04-09', '2024-04-14']
season_labels      = ['2014-15', '2015-16', '2016-17', '2017-18', '2018-19',
                      '2019-20', '2019-20', '2020-21', '2021-22', '2022-23', '2023-24']

In [5]:
# clean up the data
team_bs_df_cleaned = utl.clean_team_bs_data(team_bs_df, season_start_dates=season_start_dates, 
                                            season_end_dates=season_end_dates, season_labels=season_labels)

Season 2014-15: 1230 games
Season 2015-16: 1230 games
Season 2016-17: 1230 games
Season 2017-18: 1230 games
Season 2018-19: 1208 games
Season 2019-20: 1059 games
Season 2020-21: 1080 games
Season 2021-22: 1230 games
Season 2022-23: 1230 games
Season 2023-24: 736 games


<a name="1.2.-Reshape-to-Game-Matchups"></a>
## 1.2. Reshape to Game Matchups

[Return to top](#Feature-Engineering)

In [6]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# reshape team box score data to wide format so each row is a game matchup
team_bs_matchups_df = utl.reshape_team_bs_to_matchups(team_bs_df_cleaned, non_stats_cols)

Season 2014-15: 1225 games
Season 2015-16: 1225 games
Season 2016-17: 1229 games
Season 2017-18: 1223 games
Season 2018-19: 1204 games
Season 2019-20: 1059 games
Season 2020-21: 1074 games
Season 2021-22: 1222 games
Season 2022-23: 1221 games
Season 2023-24: 729 games


In [7]:
team_bs_matchups_df.head()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,AWAY_PLUS_MINUS
0,2014-15,1610612737,ATL,Atlanta Hawks,21400032,2014-11-01,W,240,102,35,69,0.507,7,20.0,0.35,25,33,0.758,3.0,34.0,37.0,26,10.0,6,12,...,10.0,1610612754,IND,Indiana Pacers,L,240,92,31,81,0.383,12,32.0,0.375,18,21,0.857,11.0,33.0,44.0,25,5.0,5,18,26,-10.0
1,2014-15,1610612737,ATL,Atlanta Hawks,21400084,2014-11-08,W,240,103,33,81,0.407,9,22.0,0.409,28,36,0.778,12.0,29.0,41.0,18,10.0,5,8,...,7.0,1610612752,NYK,New York Knicks,L,241,96,40,84,0.476,8,21.0,0.381,8,11,0.727,13.0,31.0,44.0,26,2.0,6,15,29,-7.0
2,2014-15,1610612737,ATL,Atlanta Hawks,21400110,2014-11-12,W,240,100,39,76,0.513,9,20.0,0.45,13,18,0.722,13.0,33.0,46.0,23,8.0,4,18,...,3.0,1610612762,UTA,Utah Jazz,L,240,97,43,86,0.5,5,23.0,0.217,6,12,0.5,8.0,22.0,30.0,28,12.0,8,11,17,-3.0
3,2014-15,1610612737,ATL,Atlanta Hawks,21400124,2014-11-14,W,240,114,42,75,0.56,11,28.0,0.393,19,23,0.826,3.0,33.0,36.0,33,10.0,5,13,...,11.0,1610612748,MIA,Miami Heat,L,240,103,35,74,0.473,10,21.0,0.476,23,25,0.92,5.0,27.0,32.0,27,10.0,3,14,20,-11.0
4,2014-15,1610612737,ATL,Atlanta Hawks,21400155,2014-11-18,L,239,109,41,85,0.482,9,27.0,0.333,18,23,0.783,13.0,25.0,38.0,22,7.0,3,10,...,-5.0,1610612747,LAL,Los Angeles Lakers,W,240,114,47,87,0.54,6,17.0,0.353,14,22,0.636,13.0,31.0,44.0,24,7.0,0,11,24,5.0


<a name="1.3.-Create-Target-Variables"></a>
## 1.3. Create Target Variables

[Return to top](#Feature-Engineering)

There are three targets of interest:

1. **Total Game Points (over / under):** This can be calculated as the sum of `HOME_PTS + AWAY_PTS`.
2. **Difference in Game Points (plus / minus):** This can be calculated in relation to the home team as the following difference: `HOME_PTS - AWAY_PTS`.
3. **Game Winner (moneyline):** This can be defined in relation to the home team using the `HOME_WL` column, where a win for the home team is equal to 1 and a loss for the home team equal to 0. We will create a new column called `GAME_RESULT` for this indicator.

In [8]:
# create the above three target variables
team_bs_matchups_df = utl.create_target_variables(team_bs_matchups_df, 'HOME_WL', 'HOME_PTS', 'AWAY_PTS')

In [9]:
team_bs_matchups_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS']].tail()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,GAME_RESULT,TOTAL_PTS,PLUS_MINUS
11406,2024-01-24,22300620,Washington Wizards,Minnesota Timberwolves,107,118,0,225,-11
11407,2024-01-25,22300628,Washington Wizards,Utah Jazz,108,123,0,231,-15
11408,2024-01-31,22300676,Washington Wizards,LA Clippers,109,125,0,234,-16
11409,2024-02-02,22300689,Washington Wizards,Miami Heat,102,110,0,212,-8
11410,2024-02-04,22300705,Washington Wizards,Phoenix Suns,112,140,0,252,-28


<a name="2.-Create-Rolling-Window-Statistics"></a>
# 2. Create Rolling Window Statistics

[Return to top](#Feature-Engineering)

Here we create average box scores for each team over a rolling window of the previous $n$-games.

In [10]:
# identify stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'AWAY_TEAM_ID',
                  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_WL', 'AWAY_WL', 'HOME_MIN', 
                  'AWAY_MIN', 'HOME_TEAM_ABBREVIATION', 'AWAY_TEAM_ABBREVIATION']
stats_cols = [col for col in team_bs_matchups_df.columns if col not in non_stats_cols]

In [11]:
# calculate rolling averages for each statistic and add them to the DataFrame
team_bs_matchups_roll_df = utl.process_rolling_stats(
    team_bs_matchups_df, 
    stats_cols, 
    window_size=5,  # the number of games to include in the rolling window
    min_obs=1       # the minimum number of observations present within the window to yield an aggregate value
)

In [12]:
team_bs_matchups_roll_df.tail()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,ROLL_HOME_PLUS_MINUS,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF,ROLL_AWAY_PLUS_MINUS
11406,2023-24,1610612764,WAS,Washington Wizards,22300620,2024-01-24,L,240,107,37,77,0.481,10,24.0,0.417,23,29,0.793,7.0,42.0,49.0,31,2.0,3,21,...,30.8,7.2,5.6,13.2,21.2,-9.8,117.4,44.2,90.0,0.493,13.6,33.6,0.411,15.4,20.2,0.764,9.8,38.2,48.0,28.2,7.4,5.2,12.6,19.8,8.2
11407,2023-24,1610612764,WAS,Washington Wizards,22300628,2024-01-25,L,240,108,43,88,0.489,7,30.0,0.233,15,20,0.75,5.0,30.0,35.0,33,11.0,7,11,...,30.4,6.6,5.6,13.4,21.6,-8.8,119.8,43.6,92.2,0.477,11.8,37.4,0.307,20.8,26.8,0.783,11.4,38.4,49.8,28.2,5.8,6.0,15.2,19.2,-6.4
11408,2023-24,1610612764,WAS,Washington Wizards,22300676,2024-01-31,L,239,109,45,97,0.464,9,29.0,0.31,10,15,0.667,12.0,33.0,45.0,19,4.0,10,13,...,30.6,7.4,6.0,12.8,19.8,-10.2,114.2,41.4,87.6,0.472,13.2,34.6,0.381,18.2,21.8,0.84,9.8,31.0,40.8,25.6,8.8,5.2,11.8,19.4,2.4
11409,2023-24,1610612764,WAS,Washington Wizards,22300689,2024-02-02,L,239,102,37,90,0.411,11,42.0,0.262,17,21,0.81,6.0,37.0,43.0,28,5.0,4,8,...,28.4,6.6,6.6,13.8,19.8,-11.0,97.2,35.8,85.6,0.417,10.0,34.0,0.289,15.6,18.6,0.828,9.8,32.0,41.8,24.0,6.0,3.2,12.6,18.0,-14.6
11410,2023-24,1610612764,WAS,Washington Wizards,22300705,2024-02-04,L,240,112,47,96,0.49,7,32.0,0.219,11,17,0.647,13.0,22.0,35.0,32,11.0,4,18,...,28.4,5.6,6.6,13.2,20.6,-11.8,120.6,46.8,84.6,0.556,11.8,29.8,0.389,15.2,20.4,0.755,8.0,32.8,40.8,27.8,6.2,5.8,16.0,16.2,0.6


In [13]:
# write out the matchups with rolling features
team_bs_matchups_roll_df.to_csv('../data/processed/nba_team_matchups_rolling_box_scores_2015_2024_r05.csv', index=False)