# NBA Game Predictor Model
### CMPE 257 Project
Authors: Kaushika Uppu, Miranda Billawala, Yun Ei Hlaing, Iris Cheung

## Imports

In [399]:
import pandas as pd
import numpy as np
import time

## NBA Game Data

First, we load in all of the NBA game data from the CSV file. Exact code for gathering data is in a separate file and use the nba_api file. Only games from the 1985-1986 season and afterward are loaded in as the seasons before that are missing a very significant portion of the game statistics' data. 

In [400]:
all_stats = pd.read_csv('all_game_stats.csv')
all_stats.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,AVAILABLE_FLAG
0,21985.0,1610612737,ATL,Atlanta Hawks,28500933,1986-04-12,ATL vs. IND,W,240.0,38,...,,,,,,,,,,
1,21985.0,1610612737,ATL,Atlanta Hawks,28500921,1986-04-10,ATL vs. NJN,W,240.0,44,...,,,,,,,,,,
2,21985.0,1610612737,ATL,Atlanta Hawks,28500908,1986-04-08,ATL vs. CHI,W,240.0,52,...,,,,,,,,,,
3,21985.0,1610612737,ATL,Atlanta Hawks,28500891,1986-04-05,ATL @ CHI,L,240.0,40,...,,,,,,,,,,
4,21985.0,1610612737,ATL,Atlanta Hawks,28500884,1986-04-04,ATL @ WAS,L,265.0,54,...,,,,,,,,,,


In [401]:
all_stats.shape

(89542, 59)

## Data Cleaning and Pre-Processing

### Inputting Missing Values

As shown below, there are a number of rows with the `SEASON_YEAR` variable missing. Therefore, we will calculate the `SEASON_YEAR` based on the `GAME_DATE` variable and fill in those missing values.

In [402]:
all_stats['SEASON_YEAR'].isna().sum()

np.int64(23370)

In [403]:
for index, row in all_stats.iterrows():
    if pd.isna(all_stats.iloc[index]['SEASON_YEAR']):
        year_index = all_stats.iloc[index]['GAME_DATE'].find('-')
        year = all_stats.iloc[index]['GAME_DATE'][:year_index]
        month = all_stats.iloc[index]['GAME_DATE'][year_index+1:year_index+3]
        if int(month) >= 10:
            season = str(int(year)) + "-" + str(int(year)+1)[2:]
        else:
            season = str(int(year)-1) + "-" + str(int(year))[2:]
        all_stats.loc[index, 'SEASON_YEAR'] = season

In [404]:
len(all_stats[all_stats['SEASON_YEAR'].isna()])

0

Then, we convert the `SEASON_YEAR` variable into an integer variable of just the year that the season started (e.g., 1985 for '1985-86').

In [405]:
all_stats['SEASON_YEAR'] = all_stats['SEASON_YEAR'].str.split('-').str[0].astype(int)

A look at the new `SEASON_YEAR` column:

In [406]:
all_stats[['SEASON_YEAR']].sample(10)

Unnamed: 0,SEASON_YEAR
75650,1991
5029,2008
30487,1991
28920,2011
419,1990
12786,2005
66470,2022
68785,2012
80491,2022
65180,2006


In [407]:
all_stats.isna().sum()

SEASON_ID            66172
TEAM_ID                  0
TEAM_ABBREVIATION        0
TEAM_NAME                0
GAME_ID                  0
GAME_DATE                0
MATCHUP                  0
WL                       0
MIN                      0
FGM                      0
FGA                      0
FG_PCT                   0
FG3M                     0
FG3A                     0
FG3_PCT                475
FTM                      0
FTA                      0
FT_PCT                   0
OREB                     0
DREB                     0
REB                      0
AST                      0
STL                      0
BLK                      0
TOV                      0
PF                       0
PTS                      0
PLUS_MINUS               0
VIDEO_AVAILABLE      66172
SEASON_YEAR              0
BLKA                 23370
PFD                  23370
GP_RANK              23370
W_RANK               23370
L_RANK               23370
W_PCT_RANK           23370
MIN_RANK             23370
F

As seen above, there are also 475 missing values in the `FG3_PCT` column. Taking a look at the `FG3A` column for the rows with missing values, we can see that they are all 0, hence why the `FG3_PCT` column has NaN values for these rows. Therefore, we filled the missing values with 0.

In [408]:
all_stats[all_stats['FG3_PCT'].isna()]['FG3A'].unique()

array([0])

In [409]:
missing_indicies = all_stats[all_stats['FG3_PCT'].isna()].index

In [410]:
for i in missing_indicies:
    all_stats.loc[i, 'FG3_PCT'] = 0

### Dropping Irrelevant Columns

There are a lot of other columns in the dataset that have a significant number of missing values. We will drop these columns, as most of them are also rankings for stats that are already in the dataset.

In [411]:
to_drop = ['SEASON_ID', 'GAME_ID', 'VIDEO_AVAILABLE', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK', 'FGA_RANK',
           'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK', 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK',
           'DREB_RANK', 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK', 'PF_RANK', 'PFD_RANK',
           'PTS_RANK', 'PLUS_MINUS_RANK', 'AVAILABLE_FLAG', 'BLKA', 'PFD' ]

In [412]:
all_stats_cleaned = all_stats.drop(columns = to_drop)

In [413]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986-04-12,ATL vs. IND,W,240.0,38,88,0.432,...,39,59,22,6,3,12.0,21,108,17.0,1985
1,1610612737,ATL,Atlanta Hawks,1986-04-10,ATL vs. NJN,W,240.0,44,87,0.506,...,27,42,30,15,5,22.0,26,126,9.0,1985
2,1610612737,ATL,Atlanta Hawks,1986-04-08,ATL vs. CHI,W,240.0,52,98,0.531,...,25,42,33,13,6,10.0,22,131,13.0,1985
3,1610612737,ATL,Atlanta Hawks,1986-04-05,ATL @ CHI,L,240.0,40,76,0.526,...,25,38,17,7,7,21.0,28,97,-5.0,1985
4,1610612737,ATL,Atlanta Hawks,1986-04-04,ATL @ WAS,L,265.0,54,100,0.54,...,28,45,24,6,7,14.0,37,129,-6.0,1985


### Cleaning Matchup Column

Next, the `MATCHUP` column contains information on the opponent as well as if it was a home or away game. To make sure these features are clear for the model, we split this information into two separate columns: `OPPONENT` and `HOME`. `HOME` is a binary variable where a value of 1 indicates a home game and a value of 0 indicates an away game. `OPPONENT` contains the team abbreviation of the other team.

Creating `HOME` variable:

In [414]:
home_away = [0 if '@' in all_stats_cleaned['MATCHUP'].iloc[i] else 1 for i in range(len(all_stats_cleaned))]

In [415]:
all_stats_cleaned.insert(5, 'HOME', home_away)

Creating `OPPONENT` variable:

In [416]:
opp = [all_stats_cleaned['MATCHUP'].iloc[i][-3:] for i in range(len(all_stats_cleaned))]

In [417]:
all_stats_cleaned.insert(6, 'OPPONENT', opp)

Finally, we got rid of the `MATCHUP` column as it now contains redundant information

In [418]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['MATCHUP'])

In [419]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WL,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986-04-12,1,IND,W,240.0,38,88,...,39,59,22,6,3,12.0,21,108,17.0,1985
1,1610612737,ATL,Atlanta Hawks,1986-04-10,1,NJN,W,240.0,44,87,...,27,42,30,15,5,22.0,26,126,9.0,1985
2,1610612737,ATL,Atlanta Hawks,1986-04-08,1,CHI,W,240.0,52,98,...,25,42,33,13,6,10.0,22,131,13.0,1985
3,1610612737,ATL,Atlanta Hawks,1986-04-05,0,CHI,L,240.0,40,76,...,25,38,17,7,7,21.0,28,97,-5.0,1985
4,1610612737,ATL,Atlanta Hawks,1986-04-04,0,WAS,L,265.0,54,100,...,28,45,24,6,7,14.0,37,129,-6.0,1985


### Cleaning up Game Date Column

In order for the model to interpret the date of the games, we decided to change the `GAME_DATE` column into datetime objects rather than keeping them as strings.

In [420]:
all_stats_cleaned['GAME_DATE'] = pd.to_datetime(all_stats_cleaned['GAME_DATE'], yearfirst=True, format='ISO8601')

A look at the new `GAME_DATE` column:

In [421]:
all_stats_cleaned[['GAME_DATE']].sample(10)

Unnamed: 0,GAME_DATE
42980,2000-03-31
41448,2018-12-03
25247,2004-11-30
60435,1987-01-01
87535,1997-01-06
11636,1990-11-02
14143,2023-01-13
43011,2000-01-24
69555,2023-02-11
20819,1989-01-30


### Cleaning up WL Column

The `WL` column states whether the team won or lost that specific game. However, we decided to convert this information into a binary variable `WIN`, which holds 1 for a win and 0 for a loss.

In [422]:
win = [1 if all_stats_cleaned.iloc[i]['WL'] == 'W' else 0 for i in range(len(all_stats_cleaned))]

In [423]:
all_stats_cleaned.insert(6, 'WIN', win)

Dropping `WL` column:

In [424]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['WL'])

In [425]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WIN,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986-04-12,1,IND,1,240.0,38,88,...,39,59,22,6,3,12.0,21,108,17.0,1985
1,1610612737,ATL,Atlanta Hawks,1986-04-10,1,NJN,1,240.0,44,87,...,27,42,30,15,5,22.0,26,126,9.0,1985
2,1610612737,ATL,Atlanta Hawks,1986-04-08,1,CHI,1,240.0,52,98,...,25,42,33,13,6,10.0,22,131,13.0,1985
3,1610612737,ATL,Atlanta Hawks,1986-04-05,0,CHI,0,240.0,40,76,...,25,38,17,7,7,21.0,28,97,-5.0,1985
4,1610612737,ATL,Atlanta Hawks,1986-04-04,0,WAS,0,265.0,54,100,...,28,45,24,6,7,14.0,37,129,-6.0,1985


### Merging Home and Away Team stats of each game into a single row

Currently, each game is represented by two separate rows in the dataset - one for the home team and one for the away team. To make the data clearer, we decided to combine the two rows into a single row per game.

Firstly, we split the dataset into two : home games and away games. Then, we performed a join on these two datasets, matching each home team with its corresponding opponent based on the same dates. 

In [426]:
home = all_stats_cleaned[all_stats_cleaned.HOME == 1]
away = all_stats_cleaned[all_stats_cleaned.HOME == 0]

In [427]:
combined_stats = pd.merge(home, away, 
                          left_on=['GAME_DATE', 'OPPONENT'], 
                          right_on=['GAME_DATE', 'TEAM_ABBREVIATION'],
                          suffixes=('_HOME', '_AWAY'))

In [428]:
combined_stats.head(5)

Unnamed: 0,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,GAME_DATE,HOME_HOME,OPPONENT_HOME,WIN_HOME,MIN_HOME,FGM_HOME,FGA_HOME,...,DREB_AWAY,REB_AWAY,AST_AWAY,STL_AWAY,BLK_AWAY,TOV_AWAY,PF_AWAY,PTS_AWAY,PLUS_MINUS_AWAY,SEASON_YEAR_AWAY
0,1610612737,ATL,Atlanta Hawks,1986-04-12,1,IND,1,240.0,38,88,...,36,43,22,7,3,13.0,33,91,-17.0,1985
1,1610612737,ATL,Atlanta Hawks,1986-04-10,1,NJN,1,240.0,44,87,...,30,44,25,10,1,24.0,30,117,-9.0,1985
2,1610612737,ATL,Atlanta Hawks,1986-04-08,1,CHI,1,240.0,52,98,...,35,44,29,5,1,17.0,26,118,-13.0,1985
3,1610612737,ATL,Atlanta Hawks,1986-04-01,1,WAS,1,240.0,41,90,...,30,46,19,10,6,17.0,22,91,-16.0,1985
4,1610612737,ATL,Atlanta Hawks,1986-03-29,1,CLE,0,240.0,36,84,...,25,33,31,8,5,16.0,32,123,18.0,1985


Comparing the number of rows in the combined dataset to the original shows that the dataset row have been reduced by half, as each game is now represented by a single row instead of two.

In [429]:
combined_stats.shape

(44771, 55)

In [430]:
all_stats_cleaned.shape

(89542, 28)

In [431]:
combined_stats.columns

Index(['TEAM_ID_HOME', 'TEAM_ABBREVIATION_HOME', 'TEAM_NAME_HOME', 'GAME_DATE',
       'HOME_HOME', 'OPPONENT_HOME', 'WIN_HOME', 'MIN_HOME', 'FGM_HOME',
       'FGA_HOME', 'FG_PCT_HOME', 'FG3M_HOME', 'FG3A_HOME', 'FG3_PCT_HOME',
       'FTM_HOME', 'FTA_HOME', 'FT_PCT_HOME', 'OREB_HOME', 'DREB_HOME',
       'REB_HOME', 'AST_HOME', 'STL_HOME', 'BLK_HOME', 'TOV_HOME', 'PF_HOME',
       'PTS_HOME', 'PLUS_MINUS_HOME', 'SEASON_YEAR_HOME', 'TEAM_ID_AWAY',
       'TEAM_ABBREVIATION_AWAY', 'TEAM_NAME_AWAY', 'HOME_AWAY',
       'OPPONENT_AWAY', 'WIN_AWAY', 'MIN_AWAY', 'FGM_AWAY', 'FGA_AWAY',
       'FG_PCT_AWAY', 'FG3M_AWAY', 'FG3A_AWAY', 'FG3_PCT_AWAY', 'FTM_AWAY',
       'FTA_AWAY', 'FT_PCT_AWAY', 'OREB_AWAY', 'DREB_AWAY', 'REB_AWAY',
       'AST_AWAY', 'STL_AWAY', 'BLK_AWAY', 'TOV_AWAY', 'PF_AWAY', 'PTS_AWAY',
       'PLUS_MINUS_AWAY', 'SEASON_YEAR_AWAY'],
      dtype='object')

### Dropping duplicate columns 

After merging the rows, there are some columns that appear twice or are now unneccessary to the dataset. These columns include MIN_HOME/MIN_AWAY (length of game in minutes), SEASON_YEAR_HOME/SEASON_YEAR_AWAY, OPPONENT_HOME, OPPONENT_AWAY, HOME_HOME and HOME_AWAY.

We first checked if the MIN_HOME and MIN_AWAY for each row has the same values. As seen below, there are 24 games where the minutes differed slightly. However, since the difference did not seem to be significant, we decided to retain one column and rename it MIN.

In [432]:
(combined_stats['MIN_HOME'] != combined_stats['MIN_AWAY']).sum()

np.int64(24)

In [433]:
combined_stats[combined_stats['MIN_HOME'] != combined_stats['MIN_AWAY']][['MIN_HOME','MIN_AWAY']]

Unnamed: 0,MIN_HOME,MIN_AWAY
455,48.0,47.448
3612,48.0,47.637333
6039,48.0,47.906667
7608,48.0,47.517333
12325,48.0,47.357667
19857,48.0,47.599333
24354,48.0,47.813333
25946,48.0,47.456
30645,53.0,52.906667
32173,47.881,48.0


In [434]:
combined_stats = combined_stats.drop(columns = ['MIN_AWAY', 'OPPONENT_HOME', 'OPPONENT_AWAY', 'HOME_HOME', 'HOME_AWAY', 'SEASON_YEAR_AWAY'])
combined_stats.rename(columns={'MIN_HOME': 'MIN', 'SEASON_YEAR_HOME': 'SEASON_YEAR'}, inplace=True)

In [435]:
combined_stats.columns

Index(['TEAM_ID_HOME', 'TEAM_ABBREVIATION_HOME', 'TEAM_NAME_HOME', 'GAME_DATE',
       'WIN_HOME', 'MIN', 'FGM_HOME', 'FGA_HOME', 'FG_PCT_HOME', 'FG3M_HOME',
       'FG3A_HOME', 'FG3_PCT_HOME', 'FTM_HOME', 'FTA_HOME', 'FT_PCT_HOME',
       'OREB_HOME', 'DREB_HOME', 'REB_HOME', 'AST_HOME', 'STL_HOME',
       'BLK_HOME', 'TOV_HOME', 'PF_HOME', 'PTS_HOME', 'PLUS_MINUS_HOME',
       'SEASON_YEAR', 'TEAM_ID_AWAY', 'TEAM_ABBREVIATION_AWAY',
       'TEAM_NAME_AWAY', 'WIN_AWAY', 'FGM_AWAY', 'FGA_AWAY', 'FG_PCT_AWAY',
       'FG3M_AWAY', 'FG3A_AWAY', 'FG3_PCT_AWAY', 'FTM_AWAY', 'FTA_AWAY',
       'FT_PCT_AWAY', 'OREB_AWAY', 'DREB_AWAY', 'REB_AWAY', 'AST_AWAY',
       'STL_AWAY', 'BLK_AWAY', 'TOV_AWAY', 'PF_AWAY', 'PTS_AWAY',
       'PLUS_MINUS_AWAY'],
      dtype='object')

## Exploratory Data Analysis
In this section, we take a look at the data to better understand the different features as well as any possible trends.

## Feature Selection / Feature Importance 

In [436]:
all_stats_cleaned.columns

Index(['TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_DATE', 'HOME',
       'OPPONENT', 'WIN', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A',
       'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'SEASON_YEAR'],
      dtype='object')