# NBA Game Predictor -- Import & Clean Data
Since the data takes time to load, we have conveniently placed it in a csv file in the project. However, if the file is not accessible, you can run the following code to get and export the data into a csv file. This includes cleaning the data; however, the file in the project is already cleaned.

In [None]:
# to install the nba_api package, uncomment line below
#! pip install nba_api

In [None]:
# import necessary packages
import pandas as pd
import numpy as np
import time
from nba_api.stats.static import teams
from nba_api.stats.endpoints import teamgamelogs

## Gather NBA Data

Only games from the 1985-1986 season and afterward are loaded in as the seasons before that are missing a very significant portion of the game statistics' data. The following code is to get all the data and load it into a csv file for later use.

In [None]:
nba_teams = teams.get_teams()
all_teamids = []
for team in nba_teams:
    all_teamids.append(team['id'])

In [None]:
def get_all_nba_seasons(start_year = 2024, end_year = None):
    if end_year is None:
        end_year = datetime.datetime.now().year
    
    seasons = []
    for year in range(start_year, end_year + 1):
        seasons.append(f"{year}-{str(year + 1)[2:]}")
    
    return seasons

In [None]:
all_seasons = get_all_nba_seasons(end_year = 2025)

In [None]:
def get_team_logs(team_id, season):
    team_log = teamgamelogs.TeamGameLogs(team_id_nullable = team_id, season_nullable = season)
    games = team_log.get_data_frames()[0]
    return games

In [None]:
dfs = []
for t_id in all_teamids:
    for season in all_seasons:
        curr_game_logs = get_team_logs(t_id, season)
        if curr_game_logs is not None:
            dfs.append(curr_game_logs)
        time.sleep(.600)

In [None]:
all_stats = pd.concat(dfs, ignore_index = True)

In [None]:
all_stats.head()

## Clean Data

### Input Missing Values
As shown below, there are a number of rows with the `SEASON_YEAR` variable missing. Therefore, we will calculate the `SEASON_YEAR` based on the `GAME_DATE` variable and fill in those missing values.

In [None]:
all_stats = pd.read_csv('all_game_stats.csv')
all_stats.head()

In [None]:
all_stats['SEASON_YEAR'].isna().sum()

In [None]:
for index, row in all_stats.iterrows():
    if pd.isna(all_stats.iloc[index]['SEASON_YEAR']):
        year_index = all_stats.iloc[index]['GAME_DATE'].find('-')
        year = all_stats.iloc[index]['GAME_DATE'][:year_index]
        month = all_stats.iloc[index]['GAME_DATE'][year_index+1:year_index+3]
        if int(month) >= 10:
            season = str(int(year)) + "-" + str(int(year)+1)[2:]
        else:
            season = str(int(year)-1) + "-" + str(int(year))[2:]
        all_stats.loc[index, 'SEASON_YEAR'] = season

In [None]:
len(all_stats[all_stats['SEASON_YEAR'].isna()])

Then, we convert the `SEASON_YEAR` variable into an integer variable of just the year that the season started (e.g., 1985 for '1985-86').

In [None]:
all_stats['SEASON_YEAR'] = all_stats['SEASON_YEAR'].str.split('-').str[0].astype(int)

A look at the new `SEASON_YEAR` column:

In [None]:
all_stats[['SEASON_YEAR']].sample(10)

In [None]:
all_stats.isna().sum()

As seen above, there are also 475 missing values in the `FG3_PCT` column. Taking a look at the `FG3A` column for the rows with missing values, we can see that they are all 0, hence why the `FG3_PCT` column has NaN values for these rows. Therefore, we filled the missing values with 0.

In [None]:
all_stats[all_stats['FG3_PCT'].isna()]['FG3A'].unique()

In [None]:
missing_indicies = all_stats[all_stats['FG3_PCT'].isna()].index

In [None]:
for i in missing_indicies:
    all_stats.loc[i, 'FG3_PCT'] = 0

### Dropping Irrelevant Columns
There are a lot of other columns in the dataset that have a significant number of missing values. We will drop these columns, as most of them are also rankings for stats that are already in the dataset.

In [None]:
all_stats.columns

In [None]:
to_keep = ['SEASON_YEAR', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'PF', 'PTS', 'PLUS_MINUS']

In [None]:
all_stats_cleaned = all_stats[to_keep]

In [None]:
all_stats_cleaned.head()

### Fixing Team ID
Since Team ID seems to start at 1610612737, we are going to subtract this value from each `TEAM_ID` to get more readable numbers. And, we want to create a dictionary to hold team names. 

In [None]:
all_stats_cleaned['TEAM_ID'] = all_stats_cleaned['TEAM_ID'] - 1610612737

In [None]:
all_stats_cleaned.head(5)

We then want to create a dictionary so we can determine ID from abbreviation and vice versa.

In [None]:
team_id_to_abb = {} # dictionary to convert from team_id to team_abbreviation
team_abb_to_id = {} # dictionary to convert from team_abbreviation to team_id

teams = (all_stats_cleaned[['TEAM_ID', 'TEAM_ABBREVIATION']]).drop_duplicates()

for index, row in teams.iterrows() :
    if row['TEAM_ID'] not in team_id_to_abb.keys():
        team_id_to_abb[row['TEAM_ID']] = []
    team_id_to_abb[row['TEAM_ID']].append(row['TEAM_ABBREVIATION'])
    team_abb_to_id[row['TEAM_ABBREVIATION']] = row['TEAM_ID']

### Cleaning Matchup Column
Next, the `MATCHUP` column contains information on the opponent as well as if it was a home or away game. To make sure these features are clear for the model, we split this information into two separate columns: `OPPONENT` and `HOME`. `HOME` is a binary variable where a value of 1 indicates a home game and a value of 0 indicates an away game. `OPPONENT` contains the team abbreviation of the other team.

Creating `HOME` variable:

In [None]:
home_away = [0 if '@' in all_stats_cleaned['MATCHUP'].iloc[i] else 1 for i in range(len(all_stats_cleaned))]

In [None]:
all_stats_cleaned.insert(5, 'HOME', home_away)

Creating `OPPONENT` and `OPPONENT_ID` variable:

In [None]:
opp = [all_stats_cleaned['MATCHUP'].iloc[i][-3:] for i in range(len(all_stats_cleaned))]

In [None]:
all_stats_cleaned.insert(6, 'OPPONENT', opp)

Finally, we got rid of the `MATCHUP` column as it now contains redundant information

In [None]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['MATCHUP'])

In [None]:
all_stats_cleaned.head()

### Cleaning up Game Date Column
In order for the model to interpret the date of the games, we decided to change the `GAME_DATE` column into datetime objects rather than keeping them as strings.

In [None]:
all_stats_cleaned['GAME_DATE'] = pd.to_datetime(all_stats_cleaned['GAME_DATE'], yearfirst=True, format='ISO8601')

A look at the new `GAME_DATE` column:

In [None]:
all_stats_cleaned[['GAME_DATE']].sample(5)

### Cleaning up WL Column
The `WL` column states whether the team won or lost that specific game. However, we decided to convert this information into a binary variable `WIN`, which holds 1 for a win and 0 for a loss.

In [None]:
win = [1 if all_stats_cleaned.iloc[i]['WL'] == 'W' else 0 for i in range(len(all_stats_cleaned))]

In [None]:
all_stats_cleaned.insert(6, 'WIN', win)

Dropping `WL` column:

In [None]:
all_stats_cleaned = all_stats_cleaned.drop(columns = ['WL'])

In [None]:
all_stats_cleaned.head()

In [None]:
all_stats_cleaned.to_csv('all_stats_cleaned.csv', index = False)