# **Cleaning the Data**

## Objectives

* Inspect data to determine how best to clean it.

## Inputs

* `game.csv` obtained in the previous notebook.

## Outputs

* The output of this notebook is a cleaned versions of `game.csv`.

## Additional Comments

* Remember to use Python 3.8.18.

* During the course of our cluster analysis in notebook 07 we noticed that data was missing for the 2012 season. It is missing from the original dataset. We won't worry about this as it was discovered late in the life cycle of the project and retrieving the data is beyond the scope of this project.


---

#### Change working directory

In [1]:
import os

home_dir = "/workspace/pp5-ml-dashboard"
csv_dir = "datasets/raw/csv"
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


## Section 1: Exploring the columns of our Dataset

In [2]:
import pandas as pd
from src.utils import get_df


game_df = get_df("game", "datasets/raw/csv")
print(game_df.shape)
game_df.head()

(65698, 54)


Unnamed: 0_level_0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_date,matchup_home,wl_home,min,fgm_home,fga_home,...,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24600001,21946,1610610035,HUS,Toronto Huskies,1946-11-01 00:00:00,HUS vs. NYK,L,0,25.0,,...,,,,,,,68.0,2,0,Regular Season
24600003,21946,1610610034,BOM,St. Louis Bombers,1946-11-02 00:00:00,BOM vs. PIT,W,0,20.0,59.0,...,,,,,,25.0,51.0,-5,0,Regular Season
24600002,21946,1610610032,PRO,Providence Steamrollers,1946-11-02 00:00:00,PRO vs. BOS,W,0,21.0,,...,,,,,,,53.0,-6,0,Regular Season
24600004,21946,1610610025,CHS,Chicago Stags,1946-11-02 00:00:00,CHS vs. NYK,W,0,21.0,,...,,,,,,22.0,47.0,-16,0,Regular Season
24600005,21946,1610610028,DEF,Detroit Falcons,1946-11-02 00:00:00,DEF vs. WAS,L,0,10.0,,...,,,,,,,50.0,17,0,Regular Season


These games in the first few rows are very old. One consequence of this is that many of the common statistics of today were not tracked. One way to address this would be to impute the values, but the game of basketball evolved dramatically and so using the mean would not be very reflective of the actual game. If we just used previous values that would bias the data significantly since the much of the data is missing for decades at a time.Given the amount of data we are missing, we will drop these rows. We will still have plenty of games to look at after we have done this.

Let's look at what columns are present.

In [3]:
game_df.columns

Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home', 'fga_home',
       'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home', 'ftm_home',
       'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home', 'reb_home',
       'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home', 'pts_home',
       'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type'],
      dtype='object')

Many of these columns are redundant categorical data. For example:
- team_abbreviation_home
- matchup_home
- video_available_home

These, as well as the away team versions, are not statistics that are relevant to what happened during the game. We will drop things like team name when we begin analyzing the data.

Also, we can drop any percentage based statistic as it will be completely determined by other features. For example, the columns `fg3m_home` and `fg3a_home` completely determine `fg3_pct_home`. So we will also remove any column containing the string `'_pct_'` in it.

Note that fields like `'wl_home'`/`'wl_away'` and `'plus_minus_home'`/`'plus_minus_away'` determine each other.

In [3]:
game_drop_features = [
    "team_abbreviation_home",
    "team_abbreviation_away",
    "video_available_home",
    "video_available_away",
    "matchup_home",
    "matchup_away",
    "wl_away",
    "plus_minus_away",
]
pct_columns = [col for col in game_df.columns if "_pct_" in col]
game_drop_features.extend(pct_columns)

We will drop these features later in the notebook. We may still want to look at the data they contain in order to measure the quality of the data.

---

## Section 2: Exploring of rows of `game.csv`
Consider the column `season_type`. It has the following unique values.

In [5]:
game_df["season_type"].unique()

array(['Regular Season', 'Playoffs', 'All-Star', 'All Star', 'Pre Season'],
      dtype=object)

Clearly, the All-Star games are out of place and don't represent a standard competition. Furthermore, the respective teams only play one game a season. Statistically, the number of games is negligible when compared to the roughly 1230 games played during the regular season.

In [4]:
all_star_games = game_df.query('season_type in ["All-Star", "All Star"]')
print(all_star_games.shape[0])
game_df = game_df.query('season_type != "All-Star"')
game_df = game_df.query('season_type != "All Star"')

128


There are also preseason games. These are sometimes played with teams outside of the league. Many veterans don't take the games seriously. So we remove these games from the dataset as well.

In [5]:
preseason = game_df.query('season_type == "Pre Season"')
print(preseason.shape[0])
preseason.team_name_away.unique()
game_df = game_df.query('season_type != "Pre Season"')

1536


We are primarily interested in Regular Season games, but we will leave the Play off games in for the time being.

Let's look back at the head of `game_df`.

In [6]:
game_df.head()

Unnamed: 0_level_0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_date,matchup_home,wl_home,min,fgm_home,fga_home,...,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24600001,21946,1610610035,HUS,Toronto Huskies,1946-11-01 00:00:00,HUS vs. NYK,L,0,25.0,,...,,,,,,,68.0,2,0,Regular Season
24600003,21946,1610610034,BOM,St. Louis Bombers,1946-11-02 00:00:00,BOM vs. PIT,W,0,20.0,59.0,...,,,,,,25.0,51.0,-5,0,Regular Season
24600002,21946,1610610032,PRO,Providence Steamrollers,1946-11-02 00:00:00,PRO vs. BOS,W,0,21.0,,...,,,,,,,53.0,-6,0,Regular Season
24600004,21946,1610610025,CHS,Chicago Stags,1946-11-02 00:00:00,CHS vs. NYK,W,0,21.0,,...,,,,,,22.0,47.0,-16,0,Regular Season
24600005,21946,1610610028,DEF,Detroit Falcons,1946-11-02 00:00:00,DEF vs. WAS,L,0,10.0,,...,,,,,,,50.0,17,0,Regular Season



---

### Finding a good cut off

A lot of basic statistics, such as rebounds, are missing. Recall that the 3-point line wasn't introduced until 1979. The game of basketball has changed over the years. However, it probably changed most dramatically when it absorbed the ABA. Therefore, 1976 is a could initial cut off point.

In [9]:
from src.utils import add_cat_date

game_df = add_cat_date(game_df, "game_date")
print(game_df.shape)
game_df_after_1976 = game_df.query("year >= 1976")
print(game_df_after_1976.shape)
game_df_after_1976.head()

(64034, 57)
(52875, 57)


Unnamed: 0_level_0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_date,matchup_home,wl_home,min,fgm_home,fga_home,...,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type,day,month,year
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
47503001,41975,1610612739,CLE,Cleveland Cavaliers,1976-04-13 00:00:00,CLE vs. WAS,L,240,42.0,,...,,,25.0,100.0,5,0,Playoffs,13,4,1976
47500241,41975,1610612760,SEA,Seattle SuperSonics,1976-04-13 00:00:00,SEA vs. PHX,W,240,39.0,,...,,,34.0,99.0,-3,0,Playoffs,13,4,1976
47500121,41975,1610612749,MIL,Milwaukee Bucks,1976-04-13 00:00:00,MIL vs. DET,W,240,39.0,,...,,,29.0,107.0,-3,0,Playoffs,13,4,1976
47500122,41975,1610612765,DET,Detroit Pistons,1976-04-15 00:00:00,DET vs. MIL,W,240,50.0,,...,,,31.0,123.0,-3,0,Playoffs,15,4,1976
47500222,41975,1610612764,WAS,Washington Bullets,1976-04-15 00:00:00,WAS vs. CLE,L,240,28.0,,...,,,29.0,80.0,1,0,Playoffs,15,4,1976


There are still many missing values, as mentioned above. We would like to remove seasons to minimize the amount of missing data. To figure out an appropriate range, we will look at how the amount of missing data changes over time. We believe that the amount of missing data will decrease as the NBA got better at keeping records. 

Notice that there is a pattern for the season_id column. If it is a regular season game in the 1948-1949 season, then the season_id is 21948. If it is a playoff game from that season then the season_id is 41948. This also means that the `'season_type'` column is redundant, we will add it to the list of columns to be removed.

In [10]:
from src.inspection_tools import season_data

season_data = season_data(game_df)
print(season_data.head(10))
print(season_data.iloc[-5:])

          season_id         years        game_types
game_id                                            
24600001      21946  [1946, 1947]  [Regular Season]
44601021      41946        [1947]        [Playoffs]
24700001      21947  [1947, 1948]  [Regular Season]
44700211      41947        [1948]        [Playoffs]
24800001      21948  [1948, 1949]  [Regular Season]
44800221      41948        [1949]        [Playoffs]
24900003      21949  [1949, 1950]  [Regular Season]
44900251      41949        [1950]        [Playoffs]
25000003      21950  [1950, 1951]  [Regular Season]
45000241      41950        [1951]        [Playoffs]
          season_id         years        game_types
game_id                                            
42000171      42020        [2021]        [Playoffs]
22100001      22021  [2021, 2022]  [Regular Season]
42100151      42021        [2022]        [Playoffs]
22200001      22022  [2022, 2023]  [Regular Season]
42200161      42022        [2023]        [Playoffs]


We want to see how much each season contributes to the amount of missing data. We will make a dataframe consisting only of the number of missing values for each statistic. We then remove all of the 'good columns,' those where no data is missing.

In [11]:
game_df.isna().sum()
missing_season_data = game_df.isna().groupby(game_df["season_id"]).sum()
print(missing_season_data.head(10))

good_cols = []
for col in missing_season_data.columns:
    unique_vals = missing_season_data[col].unique()
    # print(col, len(unique_vals))
    if len(unique_vals) == 1 and 0 in unique_vals:
        if col not in good_cols:
            good_cols.append(col)
    elif "_pct_" in col:
        good_cols.append(col)

print(good_cols)
missing_season_data.drop(labels=good_cols, inplace=True, axis=1)

           season_id  team_id_home  team_abbreviation_home  team_name_home  \
season_id                                                                    
21946              0             0                       0               0   
21947              0             0                       0               0   
21948              0             0                       0               0   
21949              0             0                       0               0   
21950              0             0                       0               0   
21951              0             0                       0               0   
21952              0             0                       0               0   
21953              0             0                       0               0   
21954              0             0                       0               0   
21955              0             0                       0               0   

           game_date  matchup_home  wl_home  min  fgm_home  fga

We looked at different windows ranges of years to see if there was a steady decline in the number of missing values per season. We found a very dramatic drop off at the beginning of the 1985-1986 season. This seems like a good place to cut things off.

In [13]:
total_missing = missing_season_data.sum(axis=1)
print(total_missing[30:36])
missing_season_data[30:36]

season_id
21982    10960
21983    10804
21984    10746
21985        0
21986        0
21987        0
dtype: int64


Unnamed: 0_level_0,fgm_home,fga_home,fg3m_home,fg3a_home,ftm_home,fta_home,oreb_home,dreb_home,reb_home,ast_home,...,ftm_away,fta_away,oreb_away,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away
season_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21982,0,25,0,862,0,0,927,943,23,29,...,0,0,927,943,23,26,919,862,867,24
21983,0,0,0,862,0,0,920,936,19,21,...,0,0,920,936,17,21,908,862,865,10
21984,0,0,0,864,0,0,934,942,7,4,...,0,0,934,942,5,4,890,864,867,2
21985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We will now check that something similar is true for the Play off games. That is, that the high quality record keeping is also present in playoff games.

In [14]:
play_offs_df = game_df.query('season_type == "Playoffs"')
missing_play_off_data = play_offs_df.isna().groupby(play_offs_df["season_id"]).sum()
missing_play_off_data.head(10)

good_cols = []
for col in missing_play_off_data.columns:
    unique_vals = missing_play_off_data[col].unique()
    if len(unique_vals) == 1 and 0 in unique_vals:
        if col not in good_cols:
            good_cols.append(col)
    elif any([string in col for string in ["_pct_", "fgm_", "ftm_"]]):
        good_cols.append(col)

print(good_cols)
missing_play_off_data.drop(labels=good_cols, inplace=True, axis=1)

['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home', 'fg_pct_home', 'fg3_pct_home', 'ftm_home', 'ft_pct_home', 'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away', 'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away', 'fgm_away', 'fg_pct_away', 'fg3_pct_away', 'ftm_away', 'ft_pct_away', 'pts_away', 'plus_minus_away', 'video_available_away', 'season_type', 'day', 'month', 'year', 'years']


In [16]:
total_missing_play_offs = missing_play_off_data.sum(axis=1)
print(total_missing_play_offs[30:36])
missing_play_off_data[30:36]

season_id
41981    502
41982    292
41983    817
41984      0
41985      0
41986      0
dtype: int64


Unnamed: 0_level_0,fga_home,fg3m_home,fg3a_home,fta_home,oreb_home,dreb_home,reb_home,ast_home,stl_home,blk_home,...,fg3a_away,fta_away,oreb_away,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away
season_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
41981,20,0,24,0,36,40,20,21,36,26,...,24,0,36,40,20,21,36,26,25,3
41982,0,0,22,0,25,26,2,2,24,22,...,22,0,25,26,2,2,24,22,23,0
41983,0,0,66,0,71,71,0,1,68,66,...,66,0,71,71,0,0,68,66,66,0
41984,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In fact, it seems that the improvement in the record keeping began during the 1984-1985 Play offs. We will still exclude those post season games as they are "missing the context" of the season preceding them.

---

### Removing games before 1985

Now we are tasked with removing games with `'season_id'` that ends in something smaller than 1985.

In [15]:
from src.inspection_tools import cutoff_year

game_df["after_1985"] = game_df.apply(
    lambda x: cutoff_year(x["season_id"], 1985), axis=1
)
game_df_after_1985 = game_df.query("after_1985 == True")

game_df_after_1985.head()

Unnamed: 0_level_0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_date,matchup_home,wl_home,min,fgm_home,fga_home,...,pf_away,pts_away,plus_minus_away,video_available_away,season_type,day,month,year,years,after_1985
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28500005,21985,1610612737,ATL,Atlanta Hawks,1985-10-25 00:00:00,ATL vs. WAS,L,240,41.0,92.0,...,19.0,100.0,9,0,Regular Season,25,10,1985,"[1985, 1986]",True
28500006,21985,1610612758,SAC,Sacramento Kings,1985-10-25 00:00:00,SAC vs. LAC,L,240,39.0,88.0,...,32.0,108.0,4,0,Regular Season,25,10,1985,"[1985, 1986]",True
28500010,21985,1610612765,DET,Detroit Pistons,1985-10-25 00:00:00,DET vs. MIL,W,240,39.0,88.0,...,32.0,116.0,-2,0,Regular Season,25,10,1985,"[1985, 1986]",True
28500011,21985,1610612762,UTH,Utah Jazz,1985-10-25 00:00:00,UTH vs. HOU,L,240,42.0,82.0,...,28.0,112.0,4,0,Regular Season,25,10,1985,"[1985, 1986]",True
28500008,21985,1610612744,GOS,Golden State Warriors,1985-10-25 00:00:00,GOS vs. DEN,L,240,36.0,91.0,...,40.0,119.0,14,0,Regular Season,25,10,1985,"[1985, 1986]",True


Let's briefly check to see how much missing data there is.

In [18]:
game_df_after_1985.isna().sum(axis=1).sum()

484

We will see that these missing values aren't features we are interested in.

Let's double check that our function worked properly and that we haven't missed any games from the 1985-1986 regular season or play offs.

In [19]:
all_season_ids = game_df["season_id"].unique()
after_1985_ids = game_df_after_1985["season_id"].unique()
missing_season_ids = []
for id in all_season_ids:
    if id in after_1985_ids:
        actual = game_df.query(f"season_id == {id}")
        computed = game_df_after_1985.query(f"season_id == {id}")
        if not actual.equals(computed):
            raise ValueError(f"{id} is not equal")
    else:
        missing_season_ids.append(id)

regular_season_misses = [id for id in missing_season_ids if str(id)[0] == "2"]
play_offs_misses = [id for id in missing_season_ids if str(id)[0] == "4"]
last_regular_season = max(regular_season_misses)
last_play_offs = max(play_offs_misses)
print(last_regular_season, last_play_offs)

regular_season_1985 = game_df.query("season_id == 21985")
play_offs_1985 = game_df.query("season_id == 41985")
computed_regular_season_1985 = game_df_after_1985.query("season_id == 21985")
computed_play_offs_1985 = game_df_after_1985.query("season_id == 41985")

print(regular_season_1985.equals(computed_regular_season_1985))
print(play_offs_1985.equals(computed_play_offs_1985))

21984 41984
True
True



---

### Dropping columns

We are finally ready to remove the columns discussed above. We should also remove some of the columns we added to aid with our analysis. We will remove `'season_type'` since that data is contained in the `'season_id'` and we will remove `'game_date'` since that column is now superfluous.

In [20]:
print(game_df.shape)
game_drop_features.extend(["after_1985", "years", "season_type", "game_date", "min"])
try:
    game_df_after_1985_cleaned = game_df_after_1985.drop(
        labels=game_drop_features, axis=1
    )
except Exception as e:
    print(str(e))
else:
    print("The columns have been successfully removed.")

print(game_df_after_1985_cleaned.shape)
game_df_after_1985_cleaned.isna().sum(axis=1).sum()

(64034, 59)
The columns have been successfully removed.
(44909, 40)


0

We have removed a lot of data. We got rid of 18 columns, which contained irrelevant or redundant data. Despite getting rid of approximately 30 years worth of games, we still have the data from almost 45 thousand games (2/3 of the original count). There are also no missing values to impute. Some might argue that we were too hasty, but in the 1984-1985 regular season there were 10746 missing records and only 943 games played. This means that at least 11 records were missing from each game. None of these missing records were in the columns that we remove.

In [21]:
print(game_df.query("season_id == 21984").shape)
print(game_df.query("season_id == 21984").isna().sum(axis=1).sum())
for col in game_drop_features:
    game_df.drop(labels=col, axis=1, inplace=True)
print(game_df.query("season_id == 21984").isna().sum(axis=1).sum())

(943, 59)
12489
10746


---

### Save cleaned game data

Finally, we save our cleaned game statistics.

In [22]:
from src.utils import save_df


save_df(game_df_after_1985_cleaned, "game_data_clean", "datasets/clean/csv")

---

## Conclusions

We found that a good cut off year was 1985. The ABA-NBA merger was in 1976, and this had a large impact on the NBA. Both in terms of talent and style of play. The NBA also introduced the 3 point line in 1979. Both of these would be reasonable cut offs. However, in 1985 the record keeping improved dramatically. This cut off point also leaves us with a lot of data to work with and no missing data to impute.


## Next Steps

In the next notebook, we will do some exploratory data analysis to see if there are strong relationships between any of the above features.
