<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (1984-2024)</p>
</div>

<a name="Feature Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups and Targets](#1.-Create-Team-Matchups-and-Targets)**

- [1.1. Remove any Orphan Matchups and Low Game Times](#1.1.-Remove-any-Orphan-Matchups-and-Low-Game-Times)

- [1.2. Reshape to Game Matchups](#1.2.-Reshape-to-Game-Matchups)

- [1.3. Create Target Variables](#1.3.-Create-Target-Variables)

**[2. Create Rolling Window Statistics](#2.-Create-Rolling-Window-Statistics)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
# basic modules
import os
import time
import random as rn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# plotting style
plt.style.use('seaborn-v0_8-notebook')
sns.set_style('white')
#sns.set_style('darkgrid')

# pandas tricks for better display
pd.options.display.max_columns = 50  
pd.options.display.max_rows = 500     
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

# preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# warnings
import warnings
warnings.filterwarnings("ignore")

# user defined functions
import utility_functions as utl

# Data

[Return to top](#Feature-Engineering)

In [2]:
team_bs_df = pd.read_csv('../data/original/nba_games_box_scores_1984_2024.csv')
player_bs_df = pd.read_csv('../data/original/nba_players_statistics_1946_2024.csv')

In [3]:
# convert the 'GAME_DATE' to datetime
team_bs_df['GAME_DATE'] = pd.to_datetime(team_bs_df['GAME_DATE'])

In [4]:
team_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104650 entries, 0 to 104649
Data columns (total 28 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   SEASON_ID          104650 non-null  int64         
 1   TEAM_ID            104650 non-null  int64         
 2   TEAM_ABBREVIATION  104650 non-null  object        
 3   TEAM_NAME          104650 non-null  object        
 4   GAME_ID            104650 non-null  int64         
 5   GAME_DATE          104650 non-null  datetime64[ns]
 6   MATCHUP            104650 non-null  object        
 7   WL                 104631 non-null  object        
 8   MIN                104650 non-null  int64         
 9   PTS                104650 non-null  int64         
 10  FGM                104650 non-null  int64         
 11  FGA                104650 non-null  int64         
 12  FG_PCT             104643 non-null  float64       
 13  FG3M               104650 non-null  int64   

In [5]:
player_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29746 entries, 0 to 29745
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PLAYER_ID          29746 non-null  int64  
 1   SEASON_ID          29746 non-null  object 
 2   LEAGUE_ID          29746 non-null  int64  
 3   TEAM_ID            29746 non-null  int64  
 4   TEAM_ABBREVIATION  29738 non-null  object 
 5   PLAYER_AGE         29746 non-null  float64
 6   GP                 29746 non-null  int64  
 7   GS                 23264 non-null  float64
 8   MIN                28977 non-null  float64
 9   FGM                29746 non-null  int64  
 10  FGA                29746 non-null  int64  
 11  FG_PCT             29732 non-null  float64
 12  FG3M               23713 non-null  float64
 13  FG3A               23713 non-null  float64
 14  FG3_PCT            23491 non-null  float64
 15  FTM                29746 non-null  int64  
 16  FTA                297

<a name="1.-Create-Team-Matchups-and-Targets"></a>
# 1. Create Team Matchups and Targets

[Return to top](#Feature-Engineering)

<a name="1.1.-Remove-any-Orphan-Matchups-and-Low-Game-Times"></a>
## 1.1. Remove any Orphan Matchups and Low Game Times

[Return to top](#Feature-Engineering)

We need to first check that there are home and away team matches within `team_bs_df`.

In [6]:
# do we have sets of home and away team matches?
team_bs_df['GAME_ID'].value_counts().value_counts()

count
2    52192
1      209
3       19
Name: count, dtype: int64

In [7]:
# clean up the data
team_bs_df_cleaned = utl.clean_team_bs_data(team_bs_df)

In [8]:
# check that it worked
team_bs_df_cleaned['GAME_ID'].value_counts().value_counts()

count
2    50164
1      423
Name: count, dtype: int64

<a name="1.2.-Reshape-to-Game-Matchups"></a>
## 1.2. Reshape to Game Matchups

[Return to top](#Feature-Engineering)

In [9]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# reshape team box score data to wide format so each row is a game matchup
team_bs_matchups_df = utl.reshape_team_bs_to_matchups(team_bs_df_cleaned, non_stats_cols)

In [10]:
team_bs_matchups_df.head()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,AWAY_PLUS_MINUS
0,21983,1610612737,ATL,Atlanta Hawks,28300014,1983-10-29,W,240,117,49,94,0.521,0,1.0,0.0,19,30,0.633,27.0,21.0,48.0,28,14.0,7,23,...,,1610612765,DET,Detroit Pistons,L,240,115,40,88,0.455,3,4.0,0.75,32,38,0.842,23.0,22.0,45.0,22,10.0,2,21,29,
1,21983,1610612737,ATL,Atlanta Hawks,28300027,1983-11-01,W,240,95,38,81,0.469,0,0.0,,19,30,0.633,12.0,29.0,41.0,20,7.0,10,16,...,,1610612764,WAS,Washington Bullets,L,240,92,35,74,0.473,0,0.0,,22,34,0.647,10.0,37.0,47.0,20,5.0,3,22,26,
2,21983,1610612737,ATL,Atlanta Hawks,28300041,1983-11-04,W,240,103,42,86,0.488,1,1.0,1.0,18,26,0.692,19.0,27.0,46.0,31,14.0,13,18,...,,1610612741,CHI,Chicago Bulls,L,240,90,30,80,0.375,0,3.0,0.0,30,41,0.732,25.0,28.0,53.0,13,6.0,3,26,25,
3,21983,1610612737,ATL,Atlanta Hawks,28300101,1983-11-15,W,240,107,45,84,0.536,0,0.0,,17,21,0.81,17.0,24.0,41.0,24,7.0,10,18,...,,1610612746,SDC,San Diego Clippers,L,240,102,40,78,0.513,0,2.0,0.0,22,33,0.667,17.0,22.0,39.0,28,5.0,6,19,20,
4,21983,1610612737,ATL,Atlanta Hawks,28300112,1983-11-17,W,240,99,35,66,0.53,1,2.0,0.5,28,40,0.7,10.0,35.0,45.0,20,5.0,12,23,...,,1610612755,PHL,Philadelphia 76ers,L,240,94,35,87,0.402,2,5.0,0.4,22,27,0.815,16.0,24.0,40.0,17,7.0,5,16,32,


<a name="1.3.-Create-Target-Variables"></a>
## 1.3. Create Target Variables

[Return to top](#Feature-Engineering)

There are three targets of interest:

1. **Total Game Points (over / under):** This can be calculated as the sum of `HOME_PTS + AWAY_PTS`.
2. **Difference in Game Points (plus / minus):** This can be calculated in relation to the home team as the following difference: `HOME_PTS - AWAY_PTS`.
3. **Game Winner (moneyline):** This can be defined in relation to the home team using the `HOME_WL` column, where a win for the home team is equal to 1 and a loss for the home team equal to 0. We will create a new column called `GAME_RESULT` for this indicator.

In [11]:
# create the above three target variables
team_bs_matchups_df = utl.create_target_variables(team_bs_matchups_df, 'HOME_WL', 'HOME_PTS', 'AWAY_PTS')

In [12]:
team_bs_matchups_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'GAME_RESULT', 'TOTAL_PTS', 'PLUS_MINUS']].tail()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,GAME_RESULT,TOTAL_PTS,PLUS_MINUS
50158,2024-01-24,22300620,Washington Wizards,Minnesota Timberwolves,107,118,0,225,-11
50159,2024-01-25,22300628,Washington Wizards,Utah Jazz,108,123,0,231,-15
50160,2024-01-31,22300676,Washington Wizards,LA Clippers,109,125,0,234,-16
50161,2024-02-02,22300689,Washington Wizards,Miami Heat,102,110,0,212,-8
50162,2024-02-04,22300705,Washington Wizards,Phoenix Suns,112,140,0,252,-28


<a name="2.-Create-Rolling-Window-Statistics"></a>
# 2. Create Rolling Window Statistics

[Return to top](#Feature-Engineering)

Here we create average box scores for each team over a rolling window of the previous $n$-games.

In [13]:
# identify stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'AWAY_TEAM_ID',
                  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_WL', 'AWAY_WL', 'HOME_MIN', 
                  'AWAY_MIN', 'HOME_TEAM_ABBREVIATION', 'AWAY_TEAM_ABBREVIATION']
stats_cols = [col for col in team_bs_matchups_df.columns if col not in non_stats_cols]

In [14]:
# calculate rolling averages for each statistic and add them to the DataFrame
team_bs_matchups_roll_df = utl.process_rolling_stats(
    team_bs_matchups_df, 
    stats_cols, 
    window_size=5,  # the number of games to include in the rolling window
    min_obs=1       # the minimum number of observations present within the window to yield an aggregate value
)

In [15]:
team_bs_matchups_roll_df.tail()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,ROLL_HOME_AST,ROLL_HOME_STL,ROLL_HOME_BLK,ROLL_HOME_TOV,ROLL_HOME_PF,ROLL_HOME_PLUS_MINUS,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FGA,ROLL_AWAY_FG_PCT,ROLL_AWAY_FG3M,ROLL_AWAY_FG3A,ROLL_AWAY_FG3_PCT,ROLL_AWAY_FTM,ROLL_AWAY_FTA,ROLL_AWAY_FT_PCT,ROLL_AWAY_OREB,ROLL_AWAY_DREB,ROLL_AWAY_REB,ROLL_AWAY_AST,ROLL_AWAY_STL,ROLL_AWAY_BLK,ROLL_AWAY_TOV,ROLL_AWAY_PF,ROLL_AWAY_PLUS_MINUS
50158,22023,1610612764,WAS,Washington Wizards,22300620,2024-01-24,L,240,107,37,77,0.481,10,24.0,0.417,23,29,0.793,7.0,42.0,49.0,31,2.0,3,21,...,30.8,7.2,5.6,13.2,21.2,-9.8,117.4,44.2,90.0,0.493,13.6,33.6,0.411,15.4,20.2,0.764,9.8,38.2,48.0,28.2,7.4,5.2,12.6,19.8,8.2
50159,22023,1610612764,WAS,Washington Wizards,22300628,2024-01-25,L,240,108,43,88,0.489,7,30.0,0.233,15,20,0.75,5.0,30.0,35.0,33,11.0,7,11,...,30.4,6.6,5.6,13.4,21.6,-8.8,119.8,43.6,92.2,0.477,11.8,37.4,0.307,20.8,26.8,0.783,11.4,38.4,49.8,28.2,5.8,6.0,15.2,19.2,-6.4
50160,22023,1610612764,WAS,Washington Wizards,22300676,2024-01-31,L,239,109,45,97,0.464,9,29.0,0.31,10,15,0.667,12.0,33.0,45.0,19,4.0,10,13,...,30.6,7.4,6.0,12.8,19.8,-10.2,114.2,41.4,87.6,0.472,13.2,34.6,0.381,18.2,21.8,0.84,9.8,31.0,40.8,25.6,8.8,5.2,11.8,19.4,2.4
50161,22023,1610612764,WAS,Washington Wizards,22300689,2024-02-02,L,239,102,37,90,0.411,11,42.0,0.262,17,21,0.81,6.0,37.0,43.0,28,5.0,4,8,...,28.4,6.6,6.6,13.8,19.8,-11.0,97.2,35.8,85.6,0.417,10.0,34.0,0.289,15.6,18.6,0.828,9.8,32.0,41.8,24.0,6.0,3.2,12.6,18.0,-14.6
50162,22023,1610612764,WAS,Washington Wizards,22300705,2024-02-04,L,240,112,47,96,0.49,7,32.0,0.219,11,17,0.647,13.0,22.0,35.0,32,11.0,4,18,...,28.4,5.6,6.6,13.2,20.6,-11.8,120.6,46.8,84.6,0.556,11.8,29.8,0.389,15.2,20.4,0.755,8.0,32.8,40.8,27.8,6.2,5.8,16.0,16.2,0.6


In [16]:
# write out the matchups with rolling features
team_bs_matchups_roll_df.to_csv('../data/processed/nba_team_matchups_rolling_box_scores_1984_2024_r05.csv', index=False)