<h1>Get Full Datasets</h1>

In [1]:
import numpy as np
import pandas as pd

In this notebook I join the award data and player data and then split up the data into individual datasets to be used for each award. I also add a few new features to my dataset (such as % of games started and season number).
<br><br>
This notebook should be run after you have successfully scraped the data using the 'scrape_data' notebook.

<h2>Load Data</h2>

In [2]:
award_data = pd.read_csv('data/award_data.csv')
award_data.head()

Unnamed: 0,player,season,award,first_place_votes,award_pts_won,award_pts_max
0,Bob Pettit,1956,mvp,33.0,33.0,80
1,Paul Arizin,1956,mvp,21.0,21.0,80
2,Bob Cousy,1956,mvp,11.0,11.0,80
3,Mel Hutchins,1956,mvp,9.0,9.0,80
4,Dolph Schayes,1956,mvp,2.0,2.0,80


In [3]:
player_season_data = pd.read_csv('data/player_seasons.csv')
player_season_data.tail()

Unnamed: 0,player,season,age,team,position,g,gs,mp,fg,fga,...,orb,drb,trb,ast,stl,blk,tov,pf,pts,trp_dbl
22383,Kelenna Azubuike,2008,24,GSW,SG,81,17.0,1732.0,261,587,...,107.0,217.0,324.0,75,45.0,34.0,58.0,159,692,
22384,Kelenna Azubuike,2009,25,GSW,SF,74,51.0,2375.0,392,845,...,112.0,258.0,370.0,117,57.0,50.0,95.0,183,1063,
22385,Kelenna Azubuike,2010,26,GSW,SF,9,7.0,231.0,48,88,...,12.0,29.0,41.0,10,5.0,9.0,7.0,16,125,
22386,Kelenna Azubuike,2012,28,DAL,SG,3,0.0,18.0,3,8,...,0.0,0.0,0.0,0,1.0,0.0,4.0,1,7,
22387,Udoka Azubuike,2021,21,UTA,C,12,0.0,49.0,4,7,...,4.0,9.0,13.0,0,1.0,4.0,3.0,8,12,


<h2>Extract new features from data</h2>

In [4]:
# add season number as column to data
player_season_data["season_num"] = player_season_data.groupby("player")["season"].rank(method="first", ascending=True)
# add games_started_pct feature
player_season_data["gs_pct"] = player_season_data["gs"] / player_season_data["g"]

Unnamed: 0,player,season,age,team,position,g,gs,mp,fg,fga,...,trb,ast,stl,blk,tov,pf,pts,trp_dbl,season_num,gs_pct
22383,Kelenna Azubuike,2008,24,GSW,SG,81,17.0,1732.0,261,587,...,324.0,75,45.0,34.0,58.0,159,692,,2.0,0.209877
22384,Kelenna Azubuike,2009,25,GSW,SF,74,51.0,2375.0,392,845,...,370.0,117,57.0,50.0,95.0,183,1063,,3.0,0.689189
22385,Kelenna Azubuike,2010,26,GSW,SF,9,7.0,231.0,48,88,...,41.0,10,5.0,9.0,7.0,16,125,,4.0,0.777778
22386,Kelenna Azubuike,2012,28,DAL,SG,3,0.0,18.0,3,8,...,0.0,0,1.0,0.0,4.0,1,7,,5.0,0.0
22387,Udoka Azubuike,2021,21,UTA,C,12,0.0,49.0,4,7,...,13.0,0,1.0,4.0,3.0,8,12,,1.0,0.0


In [6]:
award_data.set_index(['player', 'season'], inplace=True)
player_season_data.set_index(['player', 'season'], inplace=True)

# fg_pct, two_pct, efg, ft_pct - fill with 0 if NA (because there were no shots attempted - pct should be 0)
# trp dbl - if na, fill with 0
player_season_data[['fg_pct', 'two_pct', 'efg', 'ft_pct', 'trp_dbl']] = player_season_data[['fg_pct', 'two_pct', 'efg', 'ft_pct', 'trp_dbl']].fillna(0)
player_season_data.fillna('N/A', inplace=True)

<h2>Generate Award-Specific Datasets</h2>

First, I split up my award data into five datasets (one dataset for each award).

In [None]:
mvp_data = award_data[award_data['award'] == 'mvp']
dpoy_data = award_data[award_data['award'] == 'dpoy']
roy_data = award_data[award_data['award'] == 'roy']
mip_data = award_data[award_data['award'] == 'mip']
smoy_data = award_data[award_data['award'] == 'smoy']

Then, I join my player season data to each award's data in order to get my award-specific dataset.

In [8]:
mvp_data = player_season_data.join(mvp_data, on=['player', 'season'])
dpoy_data = player_season_data.join(dpoy_data, on=['player', 'season'])
roy_data = player_season_data.join(roy_data, on=['player', 'season'])
mip_data = player_season_data.join(mip_data, on=['player', 'season'])
smoy_data = player_season_data.join(smoy_data, on=['player', 'season'])

In [9]:
mvp_data = mvp_data.reset_index()
dpoy_data = dpoy_data.reset_index()
roy_data = roy_data.reset_index()
mip_data = mip_data.reset_index()
smoy_data = smoy_data.reset_index()

<h2>Filtering each dataset and adding a few more features</h2>

First, I filter out player seasons from the individual award datasets that occurred before that award was established or after 2020 (that is the date of the last full season's worth of data that I had when I made these models).

In [None]:
mvp_data = mvp_data[mvp_data['season'] >= 1956]
dpoy_data = dpoy_data[dpoy_data['season'] >= 1983]
roy_data = roy_data[roy_data['season'] >= 1964]
mip_data = mip_data[mip_data['season'] >= 1986]
smoy_data = smoy_data[smoy_data['season'] >= 1984]

mvp_data = mvp_data[mvp_data['season'] <= 2020]
dpoy_data = dpoy_data[dpoy_data['season'] <= 2020]
roy_data = roy_data[roy_data['season'] <= 2020]
mip_data = mip_data[mip_data['season'] <= 2020]
smoy_data = smoy_data[smoy_data['season'] <= 2020]

In [11]:
mvp_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
dpoy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
roy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
mip_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
smoy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)

Then, I calculate the maximum award points that a player in a given season could have received and add this feature to my award datasets. This number varies each year, as there are different numbers of voters each year. It is important to include this feature because it can range from <100 back when there were few people who voted for each award to >1,000 in more recent years when many media members vote for each award, and thus will have a big impact on the number of award points that the leading vote getters will receive.

In [12]:
def get_yearly_max_pts(award_data):
    max_pts_by_yr = {}
    for yr in range(1950, 2100):
        max_pts_by_yr[yr] = 0
        
    for index, row in award_data.iterrows():
        year = row['season']
        award_pts_max = row['award_pts_max']
        if max_pts_by_yr[year] < award_pts_max:
            max_pts_by_yr[year] = award_pts_max
            
    return max_pts_by_yr

mvp_max_pts_by_yr = get_yearly_max_pts(mvp_data)
dpoy_max_pts_by_yr = get_yearly_max_pts(dpoy_data)
roy_max_pts_by_yr = get_yearly_max_pts(roy_data)
mip_max_pts_by_yr = get_yearly_max_pts(mip_data)
smoy_max_pts_by_yr = get_yearly_max_pts(smoy_data)

mvp_data['award_pts_max'] = mvp_data['season'].map(mvp_max_pts_by_yr)
dpoy_data['award_pts_max'] = dpoy_data['season'].map(dpoy_max_pts_by_yr)
roy_data['award_pts_max'] = roy_data['season'].map(roy_max_pts_by_yr)
mip_data['award_pts_max'] = mip_data['season'].map(mip_max_pts_by_yr)
smoy_data['award_pts_max'] = smoy_data['season'].map(smoy_max_pts_by_yr)

Next, I filter out non-rookies from my rookie of the year data.

In [13]:
# filter out non-rookies for roy data
roy_data = roy_data[roy_data['season_num'] == 1]

Then, in these next two cells, I add columns with net change features for each stat in the most improved player dataset. This is important for predicting the Most Improved Player award because it will tell you how much a player improved at each statistic compared to their previous season.

In [14]:
# add net change compared to previous season for mip data
stat_cols = ['g', 'gs', 'mp', 'fg', 'fga', 'fg_pct', 'three_p', 'three_pa',
                   'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft', 'fta',
                   'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov',
                   'pf', 'pts', 'gs_pct']
net_change_cols = [f'net_{stat}' for stat in stat_cols]
mip_data[net_change_cols] = pd.DataFrame([np.zeros(len(net_change_cols))], index=mip_data.index)
mip_data = mip_data.sort_values(by=['player', 'season_num'])
mip_data = mip_data.reset_index()

In [15]:
for row_num in range(len(mip_data)):
    cur_row = mip_data.iloc[row_num]
    season_num = cur_row['season_num']
    if season_num > 1:
        prev_row = mip_data.iloc[row_num - 1]
        for stat in stat_cols:
            cur_yr_stat = cur_row[stat]
            prev_yr_stat = prev_row[stat]
            if cur_yr_stat == 'N/A' or prev_yr_stat == 'N/A':
                stat_change = 0
            else:
                stat_change = cur_row[stat] - prev_row[stat]
            cur_row[f'net_{stat}'] = stat_change
        mip_data.iloc[row_num] = cur_row

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


I also need to filter out other players who are ineligible to receive specific awards. I filtered out rookies from the most improved player data, and I filtered out starters (players who started at least 50% of the games they played in) from the sixth man of the year data.

In [16]:
# filter out rookies for mip data
mip_data = mip_data[mip_data['season_num'] != 1]
# filter out starters from smoy data
smoy_data = smoy_data[smoy_data['gs_pct'] < 0.5]

<h2>Save data to csv files</h2>
Finally, let's save the data to csv files that will be used by the next notebook ('Preprocessing').

In [17]:
mvp_data.to_csv('data/mvp_data.csv')
dpoy_data.to_csv('data/dpoy_data.csv')
roy_data.to_csv('data/roy_data.csv')
mip_data.to_csv('data/mip_data.csv')
smoy_data.to_csv('data/smoy_data.csv')