# NBA Game Predictor Model
### CMPE 257 Project
Authors: Kaushika Uppu, Miranda Billawala, Yun Ei Hlaing, Iris Cheung

## Imports

In [1]:
import pandas as pd
import numpy as np
import time

## NBA Game Data

First, we load in all of the NBA game data from the csv file. Exact code for gathering data is in a separate file and use the nba_api file. Only games from the 1985-1986 season and afterward are loaded in as the seasons before that are missing a very significant portion of the game statistics' data. 

In [11]:
all_stats = pd.read_csv('all_game_stats.csv')
all_stats.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,AVAILABLE_FLAG
0,21985.0,1610612737,ATL,Atlanta Hawks,28500933,1986-04-12,ATL vs. IND,W,240.0,38,...,,,,,,,,,,
1,21985.0,1610612737,ATL,Atlanta Hawks,28500921,1986-04-10,ATL vs. NJN,W,240.0,44,...,,,,,,,,,,
2,21985.0,1610612737,ATL,Atlanta Hawks,28500908,1986-04-08,ATL vs. CHI,W,240.0,52,...,,,,,,,,,,
3,21985.0,1610612737,ATL,Atlanta Hawks,28500891,1986-04-05,ATL @ CHI,L,240.0,40,...,,,,,,,,,,
4,21985.0,1610612737,ATL,Atlanta Hawks,28500884,1986-04-04,ATL @ WAS,L,265.0,54,...,,,,,,,,,,


In [12]:
all_stats.shape

(89542, 59)

## Data Cleaning and Pre-Processing

### Inputting Missing Values

As shown below, there are a number of rows with the `SEASON_YEAR` variable missing. Therefore, we will calculate the `SEASON_YEAR` based on the `GAME_DATE` variable and fill in those missing values.

In [14]:
all_stats['SEASON_YEAR'].isna().sum()

23370

In [15]:
for index, row in all_stats.iterrows():
    if pd.isna(all_stats.iloc[index]['SEASON_YEAR']):
        year_index = all_stats.iloc[index]['GAME_DATE'].find('-')
        year = all_stats.iloc[index]['GAME_DATE'][:year_index]
        month = all_stats.iloc[index]['GAME_DATE'][year_index+1:year_index+3]
        if int(month) >= 10:
            season = str(int(year)) + "-" + str(int(year)+1)[2:]
        else:
            season = str(int(year)-1) + "-" + str(int(year))[2:]
        all_stats.loc[index, 'SEASON_YEAR'] = season

In [16]:
len(all_stats[all_stats['SEASON_YEAR'].isna()])

0

In [17]:
all_stats.isna().sum()

SEASON_ID            66172
TEAM_ID                  0
TEAM_ABBREVIATION        0
TEAM_NAME                0
GAME_ID                  0
GAME_DATE                0
MATCHUP                  0
WL                       0
MIN                      0
FGM                      0
FGA                      0
FG_PCT                   0
FG3M                     0
FG3A                     0
FG3_PCT                475
FTM                      0
FTA                      0
FT_PCT                   0
OREB                     0
DREB                     0
REB                      0
AST                      0
STL                      0
BLK                      0
TOV                      0
PF                       0
PTS                      0
PLUS_MINUS               0
VIDEO_AVAILABLE      66172
SEASON_YEAR              0
BLKA                 23370
PFD                  23370
GP_RANK              23370
W_RANK               23370
L_RANK               23370
W_PCT_RANK           23370
MIN_RANK             23370
F

As seen above, there are also 475 missing values in the `FG3_PCT` column. Taking a look at the `FG3A` column for the rows with missing values, we can see that they are all 0, hence why the `FG3_PCT` column has NaN values for these rows. Therefore, we filled the missing values with 0.

In [18]:
all_stats[all_stats['FG3_PCT'].isna()]['FG3A'].unique()

array([0], dtype=int64)

In [19]:
missing_indicies = all_stats[all_stats['FG3_PCT'].isna()].index

In [20]:
for i in missing_indicies:
    all_stats.loc[i, 'FG3_PCT'] = 0

### Dropping Irrelevant Columns

There are a lot of other columns in the dataset that have a significant number of missing values. We will drop these columns, as most of them are also rankings for stats that are already in the dataset.

In [21]:
to_drop = ['SEASON_ID', 'GAME_ID', 'VIDEO_AVAILABLE', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK', 'FGA_RANK',
           'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK', 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK',
           'DREB_RANK', 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK', 'PF_RANK', 'PFD_RANK',
           'PTS_RANK', 'PLUS_MINUS_RANK', 'AVAILABLE_FLAG', 'BLKA', 'PFD' ]

In [22]:
all_stats_cleaned = all_stats.drop(columns = to_drop)

In [23]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986-04-12,ATL vs. IND,W,240.0,38,88,0.432,...,39,59,22,6,3,12.0,21,108,17.0,1985-86
1,1610612737,ATL,Atlanta Hawks,1986-04-10,ATL vs. NJN,W,240.0,44,87,0.506,...,27,42,30,15,5,22.0,26,126,9.0,1985-86
2,1610612737,ATL,Atlanta Hawks,1986-04-08,ATL vs. CHI,W,240.0,52,98,0.531,...,25,42,33,13,6,10.0,22,131,13.0,1985-86
3,1610612737,ATL,Atlanta Hawks,1986-04-05,ATL @ CHI,L,240.0,40,76,0.526,...,25,38,17,7,7,21.0,28,97,-5.0,1985-86
4,1610612737,ATL,Atlanta Hawks,1986-04-04,ATL @ WAS,L,265.0,54,100,0.54,...,28,45,24,6,7,14.0,37,129,-6.0,1985-86


### Cleaning Matchup Column

Next, the `MATCHUP` column contains information on the opponent as well as if it was a home or away game. To make sure these features are clear for the model, we split this information into two separate columns: `OPPONENT` and `HOME`. `HOME` is a binary variable where a value of 1 indicates a home game and a value of 0 indicates an away game. `OPPONENT` contains the team abbreviation of the other team.

Creating `HOME` variable:

In [27]:
home_away = [0 if '@' in all_stats_cleaned['MATCHUP'].iloc[i] else 1 for i in range(len(all_stats_cleaned))]

In [28]:
all_stats_cleaned.insert(5, 'HOME', home_away)

Creating `OPPONENT` variable:

In [29]:
opp = [all_stats_cleaned['MATCHUP'].iloc[i][-3:] for i in range(len(all_stats_cleaned))]

In [30]:
all_stats_cleaned.insert(6, 'OPPONENT', opp)

Finally, we got rid of the `MATCHUP` column as it now contains redundant information

In [31]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['MATCHUP'])

In [32]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,HOME,OPPONENT,WL,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986-04-12,1,IND,W,240.0,38,88,...,39,59,22,6,3,12.0,21,108,17.0,1985-86
1,1610612737,ATL,Atlanta Hawks,1986-04-10,1,NJN,W,240.0,44,87,...,27,42,30,15,5,22.0,26,126,9.0,1985-86
2,1610612737,ATL,Atlanta Hawks,1986-04-08,1,CHI,W,240.0,52,98,...,25,42,33,13,6,10.0,22,131,13.0,1985-86
3,1610612737,ATL,Atlanta Hawks,1986-04-05,0,CHI,L,240.0,40,76,...,25,38,17,7,7,21.0,28,97,-5.0,1985-86
4,1610612737,ATL,Atlanta Hawks,1986-04-04,0,WAS,L,265.0,54,100,...,28,45,24,6,7,14.0,37,129,-6.0,1985-86


### Cleaning up Game Date Column

In order for the model to interpret the date of the games, we decided to separate the string `GAME_DATE` column into three separate numerical columns: `YEAR`, `MONTH`, and `DAY`.

In [33]:
year = [int(all_stats_cleaned['GAME_DATE'].iloc[i][:4]) for i in range(len(all_stats_cleaned))]

In [34]:
month = [int(all_stats_cleaned['GAME_DATE'].iloc[i][5:7]) for i in range(len(all_stats_cleaned))]

In [35]:
day = [int(all_stats_cleaned['GAME_DATE'].iloc[i][8:10]) for i in range(len(all_stats_cleaned))]

In [36]:
all_stats_cleaned.insert(4, 'YEAR', year)
all_stats_cleaned.insert(5, 'MONTH', month)
all_stats_cleaned.insert(6, 'DAY', day)

Dropping the now redundant `GAME_DATE` column:

In [37]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['GAME_DATE'])

In [38]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,YEAR,MONTH,DAY,HOME,OPPONENT,WL,MIN,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986,4,12,1,IND,W,240.0,...,39,59,22,6,3,12.0,21,108,17.0,1985-86
1,1610612737,ATL,Atlanta Hawks,1986,4,10,1,NJN,W,240.0,...,27,42,30,15,5,22.0,26,126,9.0,1985-86
2,1610612737,ATL,Atlanta Hawks,1986,4,8,1,CHI,W,240.0,...,25,42,33,13,6,10.0,22,131,13.0,1985-86
3,1610612737,ATL,Atlanta Hawks,1986,4,5,0,CHI,L,240.0,...,25,38,17,7,7,21.0,28,97,-5.0,1985-86
4,1610612737,ATL,Atlanta Hawks,1986,4,4,0,WAS,L,265.0,...,28,45,24,6,7,14.0,37,129,-6.0,1985-86


### Cleaning up WL Column

The `WL` column states whether the team won or lost that specific game. However, we decided to convert this information into a binary variable `WIN`, which holds 1 for a win and 0 for a loss.

In [39]:
win = [1 if all_stats_cleaned.iloc[i]['WL'] == 'W' else 0 for i in range(len(all_stats_cleaned))]

In [40]:
all_stats_cleaned.insert(8, 'WIN', win)

Dropping `WL` column:

In [41]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['WL'])

In [42]:
all_stats_cleaned.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,YEAR,MONTH,DAY,HOME,OPPONENT,WIN,MIN,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,SEASON_YEAR
0,1610612737,ATL,Atlanta Hawks,1986,4,12,1,IND,1,240.0,...,39,59,22,6,3,12.0,21,108,17.0,1985-86
1,1610612737,ATL,Atlanta Hawks,1986,4,10,1,NJN,1,240.0,...,27,42,30,15,5,22.0,26,126,9.0,1985-86
2,1610612737,ATL,Atlanta Hawks,1986,4,8,1,CHI,1,240.0,...,25,42,33,13,6,10.0,22,131,13.0,1985-86
3,1610612737,ATL,Atlanta Hawks,1986,4,5,0,CHI,0,240.0,...,25,38,17,7,7,21.0,28,97,-5.0,1985-86
4,1610612737,ATL,Atlanta Hawks,1986,4,4,0,WAS,0,265.0,...,28,45,24,6,7,14.0,37,129,-6.0,1985-86


## Exploratory Data Analysis
In this section, we take a look at the data to better understand the different features as well as any possible trends.