<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Engineering</h4>
    <p style="font-size: 20px;">NBA API Data (1984-2024)</p>
</div>

<a name="Feature Engineering"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Create Team Matchups with Boxscores](#1.-Create-Team-Matchups-with-Boxscores)**

- [1.1. Remove any Orphan Matchups](#1.1.-Remove-any-Orphan-Matchups)

- [1.2. Reshape Matchups with Team Box Scores](#1.2.-Reshape-Matchups-with-Team-Box-Scores)
  
**[2. Create Team Targets](#2.-Create-Team-Targets)**

- [2.1. Game Winner](#2.1.-Game-Winner)

- [2.2. Total Points (over / under)](#2.2.-Total-Points-(over-/-under))

- [2.3. Difference in Points (plus / minus)](#2.3.-Difference-in-Points-(plus-/-minus))

**[3. Additional Team Level Features](#3.-Additional-Team-Level-Features)**
  
**[4. Player Level Features](#3.-Player-Level-Features)**

**[5. Time Windowed Features](#4.-Time-Windowed-Features)**

# Setup

[Return to top](#Feature-Engineering)

In [1]:
# basic modules
import os
import time
import random as rn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# plotting style
plt.style.use('seaborn-v0_8-notebook')
sns.set_style('white')
#sns.set_style('darkgrid')

# pandas tricks for better display
pd.options.display.max_columns = 50  
pd.options.display.max_rows = 500     
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

# preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# warnings
import warnings
warnings.filterwarnings("ignore")

# user defined functions
import utility_functions as utl

# Data

[Return to top](#Feature-Engineering)

In [2]:
team_bs_df = pd.read_csv('../data/original/nba_games_1984_2024.csv')
player_bs_df = pd.read_csv('../data/original/nba_players_statistics_1946_2024.csv')

In [3]:
# convert the 'GAME_DATE' to datetime
team_bs_df['GAME_DATE'] = pd.to_datetime(team_bs_df['GAME_DATE'])

In [4]:
team_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104437 entries, 0 to 104436
Data columns (total 28 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   SEASON_ID          104437 non-null  int64         
 1   TEAM_ID            104437 non-null  int64         
 2   TEAM_ABBREVIATION  104437 non-null  object        
 3   TEAM_NAME          104437 non-null  object        
 4   GAME_ID            104437 non-null  int64         
 5   GAME_DATE          104437 non-null  datetime64[ns]
 6   MATCHUP            104437 non-null  object        
 7   WL                 104427 non-null  object        
 8   MIN                104437 non-null  int64         
 9   PTS                104437 non-null  int64         
 10  FGM                104437 non-null  int64         
 11  FGA                104437 non-null  int64         
 12  FG_PCT             104430 non-null  float64       
 13  FG3M               104437 non-null  int64   

In [5]:
player_bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29746 entries, 0 to 29745
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PLAYER_ID          29746 non-null  int64  
 1   SEASON_ID          29746 non-null  object 
 2   LEAGUE_ID          29746 non-null  int64  
 3   TEAM_ID            29746 non-null  int64  
 4   TEAM_ABBREVIATION  29738 non-null  object 
 5   PLAYER_AGE         29746 non-null  float64
 6   GP                 29746 non-null  int64  
 7   GS                 23264 non-null  float64
 8   MIN                28977 non-null  float64
 9   FGM                29746 non-null  int64  
 10  FGA                29746 non-null  int64  
 11  FG_PCT             29732 non-null  float64
 12  FG3M               23713 non-null  float64
 13  FG3A               23713 non-null  float64
 14  FG3_PCT            23491 non-null  float64
 15  FTM                29746 non-null  int64  
 16  FTA                297

<a name="1.-Create-Team-Matchups-with-Boxscores"></a>
# 1. Create Team Matchups with Boxscores

[Return to top](#Feature-Engineering)

<a name="1.1.-Remove-any-Orphan-Matchups"></a>
## 1.1. Remove any Orphan Matchups

[Return to top](#Feature-Engineering)

We need to first check that there are home and away team matches within `team_bs_df`.

In [6]:
# do we have home and away team matches?
team_bs_df['GAME_ID'].value_counts().value_counts()

count
2    52085
1      210
3       19
Name: count, dtype: int64

In [7]:
# count the occurrences of each GAME_ID
game_id_counts = team_bs_df['GAME_ID'].value_counts()

# filter GAME_IDs that occur exactly twice
game_ids_to_keep = game_id_counts[game_id_counts == 2].index.tolist()

# keep rows in 'team_bs_df' where GAME_ID occurs twice
team_bs_df = team_bs_df[team_bs_df['GAME_ID'].isin(game_ids_to_keep)]

In [8]:
# check that it worked
team_bs_df['GAME_ID'].value_counts().value_counts()

count
2    52085
Name: count, dtype: int64

In [9]:
# create unique identifier by combining GAME_ID and TEAM_ABBREVIATION to pivot on GAME_ID
team_bs_df['UNIQUE_ID'] = team_bs_df['GAME_ID'].astype(str) + '_' + team_bs_df['TEAM_ABBREVIATION']

In [10]:
# do we have any duplicates?
team_bs_df['UNIQUE_ID'].value_counts().value_counts()

count
1    104168
2         1
Name: count, dtype: int64

In [11]:
# drop duplicates
team_bs_df = team_bs_df.drop_duplicates('UNIQUE_ID')

In [12]:
# check that it worked
team_bs_df['UNIQUE_ID'].value_counts().value_counts()

count
1    104169
Name: count, dtype: int64

In [13]:
# drop 'UNIQUE_ID'
team_bs_df = team_bs_df.drop('UNIQUE_ID', axis=1)

<a name="1.2.-Reshape-Matchups-with-Team-Box-Scores"></a>
## 1.2. Reshape Matchups with Team Box Scores

[Return to top](#Feature-Engineering)

In [14]:
# identify non-stats columns
non_stats_cols = ['SEASON_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# filter 'team_bs_df' for home games where 'MATCHUP' contains 'vs.' 
# and rename the statistics columns with 'HOME' prefixes
home_games_df = team_bs_df[team_bs_df['MATCHUP'].str.contains(' vs. ')].rename(
    columns={col: f'HOME_{col}' for col in team_bs_df.columns if col not in non_stats_cols}
).drop('MATCHUP', axis=1)

# filter 'team_bs_df' for away games where 'MATCHUP' contains '@'
# and rename the statistics columns with 'AWAY' prefixes
away_games_df = team_bs_df[team_bs_df['MATCHUP'].str.contains(' @ ')].rename(
    columns={col: f'AWAY_{col}' for col in team_bs_df.columns if col not in non_stats_cols}
).drop('MATCHUP', axis=1)

# before merging, drop the non-stats columns from the away DataFrame
away_games_df.drop(['SEASON_ID', 'GAME_DATE'], axis=1, inplace=True)

# merge home and away DataFrames on 'GAME_ID'
team_bs_wide_df = pd.merge(home_games_df, away_games_df, on='GAME_ID')

In [15]:
team_bs_wide_df.head()

Unnamed: 0,SEASON_ID,HOME_TEAM_ID,HOME_TEAM_ABBREVIATION,HOME_TEAM_NAME,GAME_ID,GAME_DATE,HOME_WL,HOME_MIN,HOME_PTS,HOME_FGM,HOME_FGA,HOME_FG_PCT,HOME_FG3M,HOME_FG3A,HOME_FG3_PCT,HOME_FTM,HOME_FTA,HOME_FT_PCT,HOME_OREB,HOME_DREB,HOME_REB,HOME_AST,HOME_STL,HOME_BLK,HOME_TOV,...,HOME_PLUS_MINUS,AWAY_TEAM_ID,AWAY_TEAM_ABBREVIATION,AWAY_TEAM_NAME,AWAY_WL,AWAY_MIN,AWAY_PTS,AWAY_FGM,AWAY_FGA,AWAY_FG_PCT,AWAY_FG3M,AWAY_FG3A,AWAY_FG3_PCT,AWAY_FTM,AWAY_FTA,AWAY_FT_PCT,AWAY_OREB,AWAY_DREB,AWAY_REB,AWAY_AST,AWAY_STL,AWAY_BLK,AWAY_TOV,AWAY_PF,AWAY_PLUS_MINUS
0,21983,1610612737,ATL,Atlanta Hawks,28300014,1983-10-29,W,240,117,49,94,0.521,0,1.0,0.0,19,30,0.633,27.0,21.0,48.0,28,14.0,7,23,...,,1610612765,DET,Detroit Pistons,L,240,115,40,88,0.455,3,4.0,0.75,32,38,0.842,23.0,22.0,45.0,22,10.0,2,21,29,
1,21983,1610612737,ATL,Atlanta Hawks,28300027,1983-11-01,W,240,95,38,81,0.469,0,0.0,,19,30,0.633,12.0,29.0,41.0,20,7.0,10,16,...,,1610612764,WAS,Washington Bullets,L,240,92,35,74,0.473,0,0.0,,22,34,0.647,10.0,37.0,47.0,20,5.0,3,22,26,
2,21983,1610612737,ATL,Atlanta Hawks,28300041,1983-11-04,W,240,103,42,86,0.488,1,1.0,1.0,18,26,0.692,19.0,27.0,46.0,31,14.0,13,18,...,,1610612741,CHI,Chicago Bulls,L,240,90,30,80,0.375,0,3.0,0.0,30,41,0.732,25.0,28.0,53.0,13,6.0,3,26,25,
3,21983,1610612737,ATL,Atlanta Hawks,28300101,1983-11-15,W,240,107,45,84,0.536,0,0.0,,17,21,0.81,17.0,24.0,41.0,24,7.0,10,18,...,,1610612746,SDC,San Diego Clippers,L,240,102,40,78,0.513,0,2.0,0.0,22,33,0.667,17.0,22.0,39.0,28,5.0,6,19,20,
4,21983,1610612737,ATL,Atlanta Hawks,28300112,1983-11-17,W,240,99,35,66,0.53,1,2.0,0.5,28,40,0.7,10.0,35.0,45.0,20,5.0,12,23,...,,1610612755,PHL,Philadelphia 76ers,L,240,94,35,87,0.402,2,5.0,0.4,22,27,0.815,16.0,24.0,40.0,17,7.0,5,16,32,


<a name="2.-Create-Team-Targets"></a>
# 2. Create Team Targets

[Return to top](#Feature-Engineering)

<a name="2.1.-Game-Winner"></a>
## 2.1. Game Winner

[Return to top](#Feature-Engineering)

This can be `team_bs_wide_df['HOME_WL']` and `team_bs_wide_df['AWAY_WL']`.

In [16]:
team_bs_wide_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'HOME_WL', 'AWAY_TEAM_NAME', 'AWAY_WL']].head()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,HOME_WL,AWAY_TEAM_NAME,AWAY_WL
0,1983-10-29,28300014,Atlanta Hawks,W,Detroit Pistons,L
1,1983-11-01,28300027,Atlanta Hawks,W,Washington Bullets,L
2,1983-11-04,28300041,Atlanta Hawks,W,Chicago Bulls,L
3,1983-11-15,28300101,Atlanta Hawks,W,San Diego Clippers,L
4,1983-11-17,28300112,Atlanta Hawks,W,Philadelphia 76ers,L


<a name="2.2.-Total-Points-(over-/-under)"></a>
## 2.2. Total Points (over / under)

[Return to top](#Feature-Engineering)

This can be calculated as `team_bs_wide_df['HOME_PTS'] + team_bs_wide_df['AWAY_PTS']`.

In [17]:
# create a new column with the total points scored
team_bs_wide_df['TOTAL_PTS'] = team_bs_wide_df['HOME_PTS'] + team_bs_wide_df['AWAY_PTS']

In [18]:
team_bs_wide_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'TOTAL_PTS']].head()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,TOTAL_PTS
0,1983-10-29,28300014,Atlanta Hawks,Detroit Pistons,117,115,232
1,1983-11-01,28300027,Atlanta Hawks,Washington Bullets,95,92,187
2,1983-11-04,28300041,Atlanta Hawks,Chicago Bulls,103,90,193
3,1983-11-15,28300101,Atlanta Hawks,San Diego Clippers,107,102,209
4,1983-11-17,28300112,Atlanta Hawks,Philadelphia 76ers,99,94,193


<a name="2.3.-Difference-in-Points-(plus-/-minus)"></a>
## 2.3. Difference in Points (plus / minus)

[Return to top](#Feature-Engineering)

This can be calculated as `team_bs_wide_df['HOME_PTS'] - team_bs_wide_df['AWAY_PTS']` for the home team and the converse for the away team.

In [19]:
# create new columns with the score differences
team_bs_wide_df['HOME_PM'] = team_bs_wide_df['HOME_PTS'] - team_bs_wide_df['AWAY_PTS']
team_bs_wide_df['AWAY_PM'] = team_bs_wide_df['AWAY_PTS'] - team_bs_wide_df['HOME_PTS']

In [20]:
team_bs_wide_df[['GAME_DATE', 'GAME_ID',  'HOME_TEAM_NAME', 'AWAY_TEAM_NAME', 'HOME_PTS', 'AWAY_PTS', 'HOME_PM', 'AWAY_PM']].head()

Unnamed: 0,GAME_DATE,GAME_ID,HOME_TEAM_NAME,AWAY_TEAM_NAME,HOME_PTS,AWAY_PTS,HOME_PM,AWAY_PM
0,1983-10-29,28300014,Atlanta Hawks,Detroit Pistons,117,115,2,-2
1,1983-11-01,28300027,Atlanta Hawks,Washington Bullets,95,92,3,-3
2,1983-11-04,28300041,Atlanta Hawks,Chicago Bulls,103,90,13,-13
3,1983-11-15,28300101,Atlanta Hawks,San Diego Clippers,107,102,5,-5
4,1983-11-17,28300112,Atlanta Hawks,Philadelphia 76ers,99,94,5,-5


In [21]:
# write out the matchups with merged features
team_bs_wide_df.to_csv('../data/processed/nba_team_matchups_1984_2024.csv', index=False)

<a name="3.-Additional-Team-Level-Features"></a>
# 3. Additional Team Level Features

[Return to top](#Feature-Engineering)

<a name="4.-Player-Level-Features"></a>
# 4. Player Level Features

[Return to top](#Feature-Engineering)

<a name="5.-Time-Windowed-Features"></a>
# 5. Time Windowed Features

[Return to top](#Feature-Engineering)