# Cleaning
In this notebook, we clean different sets of data and save the output in the outputs directory as csv files.

## Section 1: Set up
This is the standard set up at the beginning of each notebook. We change the working directory, import the relevant packages, and load the dataframes.

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from feature_engine.selection import DropFeatures
from utils import get_df, get_info_df, info_dtype_dict,add_cat_date


games = get_df('game')
line_scores = get_df('line_score')
other_stats = get_df('other_stats')
team_info_common = get_df('team_info_common')

We will remove some of the categorical columns from the games data frame that are not pertinent to our investigation. Things like team abbreviations.

In [None]:
#print(info_dtype_dict(games))
#print(info_dtype_dict(other_stats))
#print(info_dtype_dict(line_scores))
#print(info_dtype_dict(team_info_common))

In [2]:
games_drop_features =['team_name_home','team_abbreviation_home','team_abbreviation_away','team_name_away','video_available_home','video_available_away', 'matchup_home','matchup_away']
other_stats_drop_features = ['team_abbreviation_home','team_city_home','team_abbreviation_away','team_city_away']
# remove all percentages since the raw data is more informative
percentage_stats = [key for key in games.keys() if '_pct_' in key]
games_drop_features.extend(percentage_stats)
percentage_stats = [key for key in other_stats.keys() if '_pct_' in key]
other_stats_drop_features.extend(percentage_stats)



Inspect the different types of games. Notice, there are all-star games included. The teams are completely different, so I should remove the all-star games. I should potentially remove the pre-season and playoff games as well, due to the stability of the results. But it would be interesting to leave these things in for clustering. Perhaps that is another project. I guess I also want to get rid of pre season games as they may involve non nba teams.

In [3]:

games = games[games['season_type']!='All-Star']
games = games[games['season_type']!='All Star']
games = games[games['season_type']!='Pre Season']
games['season_type'].unique()
regular_games = games[games['season_type']=='Regular Season']

Now we want to consider in what range we have NaN values. I think that we should have complete data after a certain date. At the very least, we should not globally impute the data because there are long term trends and we want to view our data as evolving over time.
I should probably also remvoe the covid games?

We create a test df that has only columns that have na vvalues.

In [4]:

games = add_cat_date(games,'game_date')
games.game_date


Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type', 'Day', 'Month', 'Year'],
      dtype='object')

In [None]:
null_cols = games.isnull().any()
#print(null_cols)
test_dict = null_cols.to_dict()
null_cols = [key for key,value in test_dict.items() if value]
#print(null_cols)
test_games = games[null_cols]
test_games.info()
test_games.head()




At some point, I should transform the team_home_id and away_id into categorical variables and one hot encode them. I think I should chec if they are stable in a sense.