# **Cleaning the Data**

## Objectives

* Do initial exploration of data and clean it. There may be some preparation of the data as well.

## Inputs

* We will require the CSV files obtained in the Data Collection notebook.

## Outputs

* We will have cleaned versions of several of our CSV files.

## Additional Comments

* Remember to use Python 3.8.18.


---

# Change working directory

We need to change the working directory to the one containing the raw form of the CSV files.

In [2]:
import os
home_dir = '/workspace/pp5-ml-dashboard'
csv_dir ='/workspace/pp5-ml-dashboard/outputs/datasets/raw/csv' 
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


# Section 1: Exploring the columns of `game.csv`

We have the following CSV files.

In [3]:
files = os.listdir(csv_dir)
for file in files:
    print(f"--{file}")

--game.csv
--line_score.csv
--other_stats.csv
--team_history.csv


In the future, we may perform more extensive analysis which takes into account some of the data in the other CSV files. For the time being, these 4 files will be more than enough. (Probably too much even.)

First, let's look at what columns we have in `game.csv`. 

Note: The utility function `get_df` doesn't require the file extension and the default directory is the one containing the csv files.

In [4]:
import pandas as pd
from src.utils import get_df
from src.inspection_tools import get_info_df, info_dtype_dict

game_df = get_df('game')
print(game_df.shape)
game_df.head()

(65698, 55)


Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
0,21946,1610610035,HUS,Toronto Huskies,24600001,1946-11-01 00:00:00,HUS vs. NYK,L,0,25.0,...,,,,,,,68.0,2,0,Regular Season
1,21946,1610610034,BOM,St. Louis Bombers,24600003,1946-11-02 00:00:00,BOM vs. PIT,W,0,20.0,...,,,,,,25.0,51.0,-5,0,Regular Season
2,21946,1610610032,PRO,Providence Steamrollers,24600002,1946-11-02 00:00:00,PRO vs. BOS,W,0,21.0,...,,,,,,,53.0,-6,0,Regular Season
3,21946,1610610025,CHS,Chicago Stags,24600004,1946-11-02 00:00:00,CHS vs. NYK,W,0,21.0,...,,,,,,22.0,47.0,-16,0,Regular Season
4,21946,1610610028,DEF,Detroit Falcons,24600005,1946-11-02 00:00:00,DEF vs. WAS,L,0,10.0,...,,,,,,,50.0,17,0,Regular Season


These games in the first few rows are very old. One consequence of this is that many of the common statistics of today were not tracked. One way to address this would be to impute the values, but the game of basketball evolved dramatically and so using the mean would not be very reflective of the actual game. We are missing a lot of data in these rows and so we will drop them. After all, the data frame contains data for approximately 65 thousand games.

Let's look at what columns are present.

In [5]:
game_df.columns

Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type'],
      dtype='object')

Many of these columns are redundant categorical data. For example:
- team_abbreviation_home
- team_name_home
- matchup_home
- video_available_home

as well as the away team versions, are not statistics that are relevant to what happened during the game. The team name even provides less information than expected as the name has changed over time, but the team_id does not.
# Attention
To Sean: did I actually do the below? or am I doing this? Edit this section appropriately nearer to the end of the project.
This is why we have kept the `team_history.csv` file, so that we can see what the team name is at various points in history. We could also keep the name columns and drop them at the last moment.

Also, we can drop any percentage based statistic as it will be completely determined by other features. For example, the columns `fg3m_home` and `fg3a_home` completely determine `fg3_pct_home`. So we will also remove any column containing the string `'_pct_'` in it.

In [6]:
game_drop_features = ['team_name_home','team_abbreviation_home','team_abbreviation_away','team_name_away','video_available_home','video_available_away', 'matchup_home','matchup_away']
pct_columns = [col for col in game_df.columns if '_pct_' in col]
game_drop_features.extend(pct_columns)

We will drop these features later in the notebook. We may still want to look at the data they contain in order to measure the quality of the data.

# Section 2: Exploring of rows of `game.csv`
Consider the column `season_type`. It has the following unique values.

In [7]:
game_df['season_type'].unique()

array(['Regular Season', 'Playoffs', 'All-Star', 'All Star', 'Pre Season'],
      dtype=object)

Clearly, the All-Star games are out of place and don't represent a standard competition. Furthermore, the respective teams only play one game a season. Statistically, the number of games is negligible when compared to our the 1230 games played during the regular season.

In [8]:
all_star_games = game_df.query('season_type in ["All-Star", "All Star"]')
print(all_star_games.shape[0])
game_df = game_df.query('season_type != "All-Star"')
game_df = game_df.query('season_type != "All Star"')


128


There are also preseason games. These are sometimes played with teams outside of the league. Many veterans don't take the games seriously. It is also the case that there are exhibition matches with non-NBA teams that take place. So we remove these games from the dataset as well.

In [9]:
preseason = game_df.query('season_type == "Pre Season"')
print(preseason.shape[0])
preseason.team_name_away.unique()
game_df = game_df.query('season_type != "Pre Season"')

1536


We are primarily interested in Regular Season games, but we will leave the Play off games in for the time being.

Let's look back at the head of `game_df`.

In [10]:
game_df.head()

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
0,21946,1610610035,HUS,Toronto Huskies,24600001,1946-11-01 00:00:00,HUS vs. NYK,L,0,25.0,...,,,,,,,68.0,2,0,Regular Season
1,21946,1610610034,BOM,St. Louis Bombers,24600003,1946-11-02 00:00:00,BOM vs. PIT,W,0,20.0,...,,,,,,25.0,51.0,-5,0,Regular Season
2,21946,1610610032,PRO,Providence Steamrollers,24600002,1946-11-02 00:00:00,PRO vs. BOS,W,0,21.0,...,,,,,,,53.0,-6,0,Regular Season
3,21946,1610610025,CHS,Chicago Stags,24600004,1946-11-02 00:00:00,CHS vs. NYK,W,0,21.0,...,,,,,,22.0,47.0,-16,0,Regular Season
4,21946,1610610028,DEF,Detroit Falcons,24600005,1946-11-02 00:00:00,DEF vs. WAS,L,0,10.0,...,,,,,,,50.0,17,0,Regular Season


A lot of basic statistics, such as rebounds, are missing. Recall that the 3-point line wasn't introduced until 1979. The game of basketball has changed over the years. However, it probably changed most dramatically when it absorbed the ABA. Therefore, we will only consider games after the merger. 
# Attention
Or will I also get rid of games before the 3-point line. I don't think it is so crazy. It could be interesting to consider both. Then imputation will be slightly interesting as well. However, there may be other factors that impact our decision as to where to truncate our dataset.

In [11]:
from src.utils import add_cat_date

game_df = add_cat_date(game_df,'game_date')
print(game_df.shape)
game_df_after_1975 = game_df.query('Year >= 1975')
print(game_df_after_1975.shape)

(64034, 58)
(53349, 58)


There are many missing values, as mentioned above. We would like to remove seasons to minimize the amount of missing data. To figure out an appropriate range, we will look at how the amount of missing data changes over time. We believe that the amount of missing data will decrease as the NBA got better at keeping records. Notice that there is a pattern for the season_id column. If it is a regular season game in the 1948-1949 season, then the season_id is 21948. If it is a playoff game from that season then the season_id is 41948. This also means that the `'season_type'` column is redundant, we will add it to the list of columns to be removed.

In [12]:
from src.inspection_tools import single_season, compute_years, season_data

season_data = season_data(game_df)
print(season_data.head(10))
print(season_data.iloc[-5:])

      season_id         years        game_types
0         21946  [1946, 1947]  [Regular Season]
331       41946        [1947]        [Playoffs]
350       21947  [1947, 1948]  [Regular Season]
544       41947        [1948]        [Playoffs]
565       21948  [1948, 1949]  [Regular Season]
925       41948        [1949]        [Playoffs]
945       21949  [1949, 1950]  [Regular Season]
1508      41949        [1950]        [Playoffs]
1538      21950  [1950, 1951]  [Regular Season]
1892      41950        [1951]        [Playoffs]
       season_id         years        game_types
62841      42020        [2021]        [Playoffs]
62928      22021  [2021, 2022]  [Regular Season]
64224      42021        [2022]        [Playoffs]
64312      22022  [2022, 2023]  [Regular Season]
65612      42022        [2023]        [Playoffs]


We want to see how much each season contributes to the amount of missing data. We will make a dataframe consisting only of the number of missing values for each statistic. We then remove all of the 'good columns,' those where no data is missing.

In [13]:
game_df.isna().sum()
missing_season_data = game_df.isna().groupby(game_df['season_id']).sum()
missing_season_data.head(10)

good_cols = []
for col in missing_season_data.columns:
    unique_vals = missing_season_data[col].unique()
    # print(col, len(unique_vals))
    if len(unique_vals) == 1 and 0 in unique_vals:
        if col not in good_cols:
            good_cols.append(col)
    elif '_pct_' in col:
        good_cols.append(col)

print(good_cols)
missing_season_data.drop(labels=good_cols, inplace=True, axis=1)

['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home', 'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fg_pct_home', 'fg3_pct_home', 'ft_pct_home', 'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away', 'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away', 'fg_pct_away', 'fg3_pct_away', 'ft_pct_away', 'pts_away', 'plus_minus_away', 'video_available_away', 'season_type', 'Day', 'Month', 'Year', 'years']


In [14]:
missing_season_data.filter(['fgm_home','fgm_away','ftm_home','ftm_away']).head()

Unnamed: 0_level_0,fgm_home,fgm_away,ftm_home,ftm_away
season_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
21946,0,0,0,0
21947,5,5,6,5
21948,0,0,0,0
21949,3,3,3,3
21950,4,3,4,3


What do we have left now? Both `'fgm'` and `'fgt'` have missing values, but there are very few and can likely be imputed somehow or the rows can be dropped all together.

In [15]:
#missing_season_data.drop(labels=['fgm_home','fgm_away','ftm_home','ftm_away'], inplace=True, axis=1)


Through inspection, we find that there is a dramatic change in record keeping at the beginning of the 1985-1986 season. This seems like a good place to cut things off.

In [16]:

total_missing = missing_season_data.sum(axis=1)
print(total_missing[30:36])
missing_season_data[30:36]

season_id
21982    10960
21983    10804
21984    10746
21985        0
21986        0
21987        0
dtype: int64


Unnamed: 0_level_0,fgm_home,fga_home,fg3m_home,fg3a_home,ftm_home,fta_home,oreb_home,dreb_home,reb_home,ast_home,...,ftm_away,fta_away,oreb_away,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away
season_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21982,0,25,0,862,0,0,927,943,23,29,...,0,0,927,943,23,26,919,862,867,24
21983,0,0,0,862,0,0,920,936,19,21,...,0,0,920,936,17,21,908,862,865,10
21984,0,0,0,864,0,0,934,942,7,4,...,0,0,934,942,5,4,890,864,867,2
21985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We will now check that something similar is true for the Play off games.

In [17]:
play_offs_df = game_df.query('season_type == "Playoffs"')
missing_play_off_data = play_offs_df.isna().groupby(play_offs_df['season_id']).sum()
missing_play_off_data.head(10)

good_cols = []
for col in missing_play_off_data.columns:
    unique_vals = missing_play_off_data[col].unique()
    if len(unique_vals) == 1 and 0 in unique_vals:
        if col not in good_cols:
            good_cols.append(col)
    elif any([string in col for string in ['_pct_','fgm_','ftm_']]):
        good_cols.append(col)

print(good_cols)
missing_play_off_data.drop(labels=good_cols, inplace=True, axis=1)

['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home', 'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home', 'fg_pct_home', 'fg3_pct_home', 'ftm_home', 'ft_pct_home', 'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away', 'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away', 'fgm_away', 'fg_pct_away', 'fg3_pct_away', 'ftm_away', 'ft_pct_away', 'pts_away', 'plus_minus_away', 'video_available_away', 'season_type', 'Day', 'Month', 'Year', 'years']


In [18]:
total_missing_play_offs = missing_play_off_data.sum(axis=1)
print(total_missing_play_offs[30:36])
missing_play_off_data[30:36]

season_id
41981    502
41982    292
41983    817
41984      0
41985      0
41986      0
dtype: int64


Unnamed: 0_level_0,fga_home,fg3m_home,fg3a_home,fta_home,oreb_home,dreb_home,reb_home,ast_home,stl_home,blk_home,...,fg3a_away,fta_away,oreb_away,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away
season_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
41981,20,0,24,0,36,40,20,21,36,26,...,24,0,36,40,20,21,36,26,25,3
41982,0,0,22,0,25,26,2,2,24,22,...,22,0,25,26,2,2,24,22,23,0
41983,0,0,66,0,71,71,0,1,68,66,...,66,0,71,71,0,0,68,66,66,0
41984,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In fact, it seems that the improvement in the record keeping began during the 1984-1985 Play offs. We will still exclude those post season games as they are "missing the context" of the season preceding them.

Now we are tasked with removing games with `'season_id'` that ends in something smaller than 1985.

In [19]:
from src.inspection_tools import cutoff_year

game_df['after_1985'] = game_df.apply(lambda x: cutoff_year(x['season_id'], 1985), axis=1)
game_df_after_1985 = game_df.query('after_1985 == True')

game_df_after_1985.head()

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,pf_away,pts_away,plus_minus_away,video_available_away,season_type,Day,Month,Year,years,after_1985
19186,21985,1610612737,ATL,Atlanta Hawks,28500005,1985-10-25 00:00:00,ATL vs. WAS,L,240,41.0,...,19.0,100.0,9,0,Regular Season,25,10,1985,"[1985, 1986]",True
19187,21985,1610612758,SAC,Sacramento Kings,28500006,1985-10-25 00:00:00,SAC vs. LAC,L,240,39.0,...,32.0,108.0,4,0,Regular Season,25,10,1985,"[1985, 1986]",True
19188,21985,1610612765,DET,Detroit Pistons,28500010,1985-10-25 00:00:00,DET vs. MIL,W,240,39.0,...,32.0,116.0,-2,0,Regular Season,25,10,1985,"[1985, 1986]",True
19189,21985,1610612762,UTH,Utah Jazz,28500011,1985-10-25 00:00:00,UTH vs. HOU,L,240,42.0,...,28.0,112.0,4,0,Regular Season,25,10,1985,"[1985, 1986]",True
19190,21985,1610612744,GOS,Golden State Warriors,28500008,1985-10-25 00:00:00,GOS vs. DEN,L,240,36.0,...,40.0,119.0,14,0,Regular Season,25,10,1985,"[1985, 1986]",True


Let's briefly check to see how much missing data there is.

In [20]:
game_df_after_1985.isna().sum(axis=1).sum()

484

We will just double check that our function worked properly and that we haven't missed any games from the 1985-1986 regular season or play offs. This is probably overkill.

In [21]:
all_season_ids = game_df['season_id'].unique()
after_1985_ids = game_df_after_1985['season_id'].unique()
missing_season_ids = []
for id in all_season_ids:
    if id in after_1985_ids:
        actual = game_df.query(f'season_id == {id}')
        computed = game_df_after_1985.query(f'season_id == {id}')
        if not actual.equals(computed):
            raise ValueError(f'{id} is not equal')
    else:
        missing_season_ids.append(id)

regular_season_misses = [id for id in missing_season_ids if str(id)[0]=='2']
play_offs_misses = [id for id in missing_season_ids if str(id)[0]=='4']
last_regular_season = max(regular_season_misses)
last_play_offs = max(play_offs_misses)
print(last_regular_season, last_play_offs)

regular_season_1985 = game_df.query('season_id == 21985')
play_offs_1985 = game_df.query('season_id == 41985')
computed_regular_season_1985 = game_df_after_1985.query('season_id == 21985')
computed_play_offs_1985 = game_df_after_1985.query('season_id == 41985')

print(regular_season_1985.equals(computed_regular_season_1985))
print(play_offs_1985.equals(computed_play_offs_1985))


21984 41984
True
True


We are finally ready to remove the columns discussed above. We should also remove some of the columns we added to aid with our analysis. We will remove `'season_type'` since that data is contained in the `'season_id'` and we will remove `'game_date'` since that column is now superfluous.

In [22]:
print(game_df.shape)
game_drop_features.extend(['after_1985','years','season_type','game_date'])
try:
    game_df_after_1985_cleaned = game_df_after_1985.drop(labels=game_drop_features, axis=1)
except Exception as e:
    print(str(e))
else:
    print("The columns have been successfully removed.")

print(game_df_after_1985_cleaned.shape)


(64034, 60)
The columns have been successfully removed.
(44909, 42)


We have removed a lot of data. We got rid of 18 columns, which contained irrelevant or redundant data. Despite getting rid of approximately 30 years worth of games, we still have the data from almost 45 thousand games. There are also no missing values to impute. Some might argue that we were too hasty, but in the 1984-1985 regular season there were 10746 missing records and only 943 games played. This means that at least 11 records were missing from each game. None of these missing records were in the columns that we remove.

In [23]:
print(game_df.query('season_id == 21984').shape)
print(game_df.query('season_id == 21984').isna().sum(axis=1).sum())
for col in game_drop_features:
    game_df.drop(labels=col, axis=1, inplace=True)
print(game_df.query('season_id == 21984').isna().sum(axis=1).sum())

(943, 60)
12489
10746


---

Finally, we save our cleaned game statistics.

In [24]:
target_dir = '/workspace/pp5-ml-dashboard/outputs/datasets/clean/csv'
if not os.path.exists(target_dir):
    os.makedirs(target_dir)
os.chdir(target_dir)
game_df_after_1985_cleaned.to_csv('game_data_clean.csv', index=False)


# Attention
Do I want to clean the other statistics? Do I want to save them? I think it is simple enough to truncate them. I can also do that later I guess. I need to also add sections to the above appropriately as well as what comes next.

I could do a correlation study for each season, that would be interesting.

# Section 3: Exploring the columns of `line_score.csv` and `other_stats.csv`

There are two other datasets that we would like to use. We will need to truncate and clean them as well. But first, let's change the directory back and look at the `'other_stats.csv'` and `'line_score.csv'`.

This is a section I will only look at after the preliminary work is done. I would have to cut 17k games to include other stats, I would have to cut only 4k to include line_score. Or I could try to impute, but that seems like imputing a lot.

In [48]:
os.chdir(home_dir)
print(os.getcwd())
os.listdir(csv_dir)
line_score_df = get_df('line_score')
other_stats_df = get_df('other_stats')
print(line_score_df.columns)
print(other_stats_df.columns)
print(line_score_df.shape)
print(other_stats_df.shape)

/workspace/pp5-ml-dashboard
Index(['game_date_est', 'game_sequence', 'game_id', 'team_id_home',
       'team_abbreviation_home', 'team_city_name_home', 'team_nickname_home',
       'team_wins_losses_home', 'pts_qtr1_home', 'pts_qtr2_home',
       'pts_qtr3_home', 'pts_qtr4_home', 'pts_ot1_home', 'pts_ot2_home',
       'pts_ot3_home', 'pts_ot4_home', 'pts_ot5_home', 'pts_ot6_home',
       'pts_ot7_home', 'pts_ot8_home', 'pts_ot9_home', 'pts_ot10_home',
       'pts_home', 'team_id_away', 'team_abbreviation_away',
       'team_city_name_away', 'team_nickname_away', 'team_wins_losses_away',
       'pts_qtr1_away', 'pts_qtr2_away', 'pts_qtr3_away', 'pts_qtr4_away',
       'pts_ot1_away', 'pts_ot2_away', 'pts_ot3_away', 'pts_ot4_away',
       'pts_ot5_away', 'pts_ot6_away', 'pts_ot7_away', 'pts_ot8_away',
       'pts_ot9_away', 'pts_ot10_away', 'pts_away'],
      dtype='object')
Index(['game_id', 'league_id', 'team_id_home', 'team_abbreviation_home',
       'team_city_home', 'pts_paint_home'

In [50]:
line_score_game_ids = line_score_df['game_id'].unique()
game_df_after_1985_cleaned_game_ids = game_df_after_1985_cleaned['game_id'].unique()
other_stats_game_ids = other_stats_df['game_id'].unique()

print(len(line_score_game_ids), len(game_df_after_1985_cleaned_game_ids), len(other_stats_game_ids))

other_stats_missing_from_cleaned = []
for id in other_stats_game_ids:
    if id not in game_df_after_1985_cleaned_game_ids:
        other_stats_missing_from_cleaned.append(id)

cleaned_id_missing_from_other_stats = []
for id in game_df_after_1985_cleaned_game_ids:
    if id not in other_stats_game_ids:
        cleaned_id_missing_from_other_stats.append(id)

print(len(other_stats_missing_from_cleaned), len(cleaned_id_missing_from_other_stats))

line_id_missing_from_cleaned = []
for id in line_score_game_ids:
    if id not in game_df_after_1985_cleaned_game_ids:
        line_id_missing_from_cleaned.append(id)

cleaned_id_missing_from_line = []
for id in game_df_after_1985_cleaned_game_ids:
    if id not in line_score_game_ids:
        cleaned_id_missing_from_line.append(id)

print(len(line_id_missing_from_cleaned), len(cleaned_id_missing_from_line))


58013 44909 28261
853 17501
17743 4639


Many of these columns are redundant categorical information. Some are redundant when taken together with the data in `'game.csv'`. Let us 

In [33]:
game_df.query('game_id == 32200001')

Unnamed: 0,season_id,team_id_home,game_id,wl_home,min,fgm_home,fga_home,fg3m_home,fg3a_home,ftm_home,...,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,Day,Month,Year


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [26]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)