# Who will be the next NBA MVP?
### Modeling MVP vote share using player and team stats 

Of all the awards in the NBA, the MVP award is the most coveted. Being the "Most Valuable Player" is a sign of a player's standing in the league; this distinction often goes to the best player, and almost always goes to a top 5 player. While "most valuable" can mean different things to voters, coaches, and players, there are some distinct commonalities between all players in the history of the award. 

The goal of this analysis is to use machine learning to predict the MVP vote share of a player given their season stats. The model will be tested on a random subset of all seasons, as well as the current 2020 season.

### Table of Contents
* [Data Collection](#datacollection)
    * [Importing](#import)
    * [Cleaning](#clean)
* [Creating the Pipeline](#pipeline)
    * [Team Stats](#teamstats)
    * [Adding MVP Share](#mvpshare)
* [Training and Testing](#ttsplit)
    * [Building the Design Matrix](#designmatrix)
    * [Training-Test Split](#split)
    * [2019-2020 Season Evaluation Set](#2020stats)
* [Cross-Validation](#cv)
* [Conclusion](#conclusion)

<a id='datacollection'></a>
# Data Collection

The first step requires collection of two main categories of data: the MVP shares by season for each player receiving a vote, and the player's season stats. The source of all this data will be from www.basketball-reference.com, a free-to-use platform that compiles all NBA data that exists, including award voting results, play-by-play data, and most importantly, team and player stats per season. 

I was able to find a spreadsheet compiled by an user of the website that contains MVP voting share referenced by season and player for all seasons. This greatly reduced the time needed to collect data, as there is no automated way to do this from the BballRef website. The source for this spreadsheet is linked in [this post](https://www.reddit.com/r/nba/comments/cqvjsi/octhe_10_players_with_the_highest_mvp_voting/).

The features of the model will be player stats, which will be obtained through a web scraping package made for BballRef. The package's documentation is linked here: [basketball_reference_scraper](https://github.com/vishaalagartha/basketball_reference_scraper/blob/master/API.md).

Below, the necessary python packages are imported.

In [1]:
#%pip install --upgrade pip

# data wrangling packages
import pandas as pd
import numpy as np

# web scraper
#%pip install basketball-reference-scraper
from basketball_reference_scraper.players import get_stats, get_game_logs, get_player_headshot
from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc

# data visualization packages
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# data modeling packages
from sklearn import linear_model as lm
from sklearn.model_selection import train_test_split, KFold

<a id='import'></a>
## Importing

The MVP share dataset is imported below as a pandas DataFrame. Additionally, the web scraper is tested.

In [2]:
mvp_share = pd.read_csv('mvp_share_by_season.csv')
mvp_share.head()

Unnamed: 0,Player,Total,1956,1957,1958,1959,1960,1961,1962,1963,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Bob Pettit,3.419,0.413,0.263,0.243,0.773,0.268,0.329,0.073,0.166,...,,,,,,,,,,
1,Paul Arizin,0.494,0.263,0.113,0.02,0.095,,0.003,,,...,,,,,,,,,,
2,Bob Cousy,0.873,0.138,0.288,0.045,0.173,0.167,0.045,0.007,0.01,...,,,,,,,,,,
3,Mel Hutchins,0.126,0.113,0.013,,,,,,,...,,,,,,,,,,
4,Dolph Schayes,0.73,0.025,0.1,0.495,0.063,0.015,0.032,,,...,,,,,,,,,,


In [3]:
kd_stats = get_stats('Kevin Durant', stat_type='PER_GAME', playoffs=False, career=False)
kd_stats

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2007-08,19.0,SEA,NBA,SG,80.0,80.0,34.6,7.3,17.1,...,0.873,0.9,3.5,4.4,2.4,1.0,0.9,2.9,1.5,20.3
1,2008-09,20.0,OKC,NBA,SF,74.0,74.0,39.0,8.9,18.8,...,0.863,1.0,5.5,6.5,2.8,1.3,0.7,3.0,1.8,25.3
2,2009-10,21.0,OKC,NBA,SF,82.0,82.0,39.5,9.7,20.3,...,0.9,1.3,6.3,7.6,2.8,1.4,1.0,3.3,2.1,30.1
3,2010-11,22.0,OKC,NBA,SF,78.0,78.0,38.9,9.1,19.7,...,0.88,0.7,6.1,6.8,2.7,1.1,1.0,2.8,2.0,27.7
4,2011-12,23.0,OKC,NBA,SF,66.0,66.0,38.6,9.7,19.7,...,0.86,0.6,7.4,8.0,3.5,1.3,1.2,3.8,2.0,28.0
5,2012-13,24.0,OKC,NBA,SF,81.0,81.0,38.5,9.0,17.7,...,0.905,0.6,7.3,7.9,4.6,1.4,1.3,3.5,1.8,28.1
6,2013-14,25.0,OKC,NBA,SF,81.0,81.0,38.5,10.5,20.8,...,0.873,0.7,6.7,7.4,5.5,1.3,0.7,3.5,2.1,32.0
7,2014-15,26.0,OKC,NBA,SF,27.0,27.0,33.8,8.8,17.3,...,0.854,0.6,6.0,6.6,4.1,0.9,0.9,2.7,1.5,25.4
8,2015-16,27.0,OKC,NBA,SF,72.0,72.0,35.8,9.7,19.2,...,0.898,0.6,7.6,8.2,5.0,1.0,1.2,3.5,1.9,28.2
9,2016-17,28.0,GSW,NBA,PF,62.0,62.0,33.4,8.9,16.5,...,0.875,0.6,7.6,8.3,4.8,1.1,1.6,2.2,1.9,25.1


<a id='clean'></a>
## Cleaning

I'll start with some preliminary EDA to determine that the values in the MVP share DataFrame are all in fact numbers.

In [4]:
mvp_share = mvp_share.set_index('Player')
mvp_share.head()

Unnamed: 0_level_0,Total,1956,1957,1958,1959,1960,1961,1962,1963,1964,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bob Pettit,3.419,0.413,0.263,0.243,0.773,0.268,0.329,0.073,0.166,0.144,...,,,,,,,,,,
Paul Arizin,0.494,0.263,0.113,0.02,0.095,,0.003,,,,...,,,,,,,,,,
Bob Cousy,0.873,0.138,0.288,0.045,0.173,0.167,0.045,0.007,0.01,,...,,,,,,,,,,
Mel Hutchins,0.126,0.113,0.013,,,,,,,,...,,,,,,,,,,
Dolph Schayes,0.73,0.025,0.1,0.495,0.063,0.015,0.032,,,,...,,,,,,,,,,


The index is changed to player names to make indexing easier.

In [5]:
print(mvp_share.loc['Kevin Durant', 'Total'], type(mvp_share.loc['Kevin Durant', 'Total']))
print(mvp_share.loc['Kevin Durant', '2014'], type(mvp_share.loc['Kevin Durant', '2014']))

3.2089999999999996 <class 'numpy.float64'>
0.986 <class 'numpy.float64'>


It looks like there are mostly NaN values in the DataFrame. In order to make comparisons easier later, we'll impute the value -1 to distinguish these values.

In [6]:
mvp_share = mvp_share.fillna(0)
mvp_share.head()

Unnamed: 0_level_0,Total,1956,1957,1958,1959,1960,1961,1962,1963,1964,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bob Pettit,3.419,0.413,0.263,0.243,0.773,0.268,0.329,0.073,0.166,0.144,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Paul Arizin,0.494,0.263,0.113,0.02,0.095,0.0,0.003,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bob Cousy,0.873,0.138,0.288,0.045,0.173,0.167,0.045,0.007,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mel Hutchins,0.126,0.113,0.013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dolph Schayes,0.73,0.025,0.1,0.495,0.063,0.015,0.032,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Additionally, I'll drop all columns from the years 1956 through 1980. This is done for a number of reasons: the MVP award was voted on by players until the 1980-1981 season, at which point it became a panel of sportswriters. Also, the 3pt line was added in the 1979-1980 season. This means I'll also have to drop all players who did not receive any MVP votes in seasons 1980-1981 and beyond. After looking at the original data, this is all players up to and including Billy Cuningham. This is confirmed in the cell directly below.

In [7]:
test = mvp_share.loc[:'Billy Cuningham', '1981':]
a = [test[i].value_counts() for i in test.columns]
a

[0.0    47
 Name: 1981, dtype: int64,
 0.0    47
 Name: 1982, dtype: int64,
 0.0    47
 Name: 1983, dtype: int64,
 0.0    47
 Name: 1984, dtype: int64,
 0.0    47
 Name: 1985, dtype: int64,
 0.0    47
 Name: 1986, dtype: int64,
 0.0    47
 Name: 1987, dtype: int64,
 0.0    47
 Name: 1988, dtype: int64,
 0.0    47
 Name: 1989, dtype: int64,
 0.0    47
 Name: 1990, dtype: int64,
 0.0    47
 Name: 1991, dtype: int64,
 0.0    47
 Name: 1992, dtype: int64,
 0.0    47
 Name: 1993, dtype: int64,
 0.0    47
 Name: 1994, dtype: int64,
 0.0    47
 Name: 1995, dtype: int64,
 0.0    47
 Name: 1996, dtype: int64,
 0.0    47
 Name: 1997, dtype: int64,
 0.0    47
 Name: 1998, dtype: int64,
 0.0    47
 Name: 1999, dtype: int64,
 0.0    47
 Name: 2000, dtype: int64,
 0.0    47
 Name: 2001, dtype: int64,
 0.0    47
 Name: 2002, dtype: int64,
 0.0    47
 Name: 2003, dtype: int64,
 0.0    47
 Name: 2004, dtype: int64,
 0.0    47
 Name: 2005, dtype: int64,
 0.0    47
 Name: 2006, dtype: int64,
 0.0    47
 

In [8]:
mvp_share = mvp_share.drop(columns=[str(i) for i in np.arange(1956, 1981)]).drop(columns=['Total'])
mvp_share['Total'] = mvp_share.apply(sum, axis=1)
mvp_share = mvp_share[mvp_share['Total'] > 0]
mvp_share

Unnamed: 0_level_0,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,Total
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Kareem Abdul-Jabbar,0.414,0.045,0.02,0.201,0.264,0.173,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,1.117
Bob Lanier,0.006,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.006
Tiny Archibald,0.046,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.046
Jamaal Wilkes,0.028,0.001,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.029
Truck Robinson,0.009,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Jimmy Butler,0.000,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005,0.000,0.005
Joel Embiid,0.000,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004,0.049,0.053
Victor Oladipo,0.000,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.000,0.002
Nikola Jokić,0.000,0.000,0.00,0.000,0.000,0.000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000,0.210,0.210


There are a few inconsistencies between the spellings of a few players names in the dataset versus real life. This next cell was used to find and fix these discrepancies manually, but by the time this is being looked at will not show any of them (except the 4 players mentioned below). 

In [10]:
for i in mvp_share.index.tolist():
    try:
        get_stats(i)
    except AttributeError:
        print(i, '[ERROR]')

Peja Stojaković [ERROR]
Manu Ginóbili [ERROR]
Goran Dragić [ERROR]
Nikola Jokić [ERROR]


Unfortunately, the scraper currently does not work with players that have accents in their names. For now, I am dropping these players from the dataframe (Manu Ginóbili, Goran Dragić, Nikola Jokić) but hope to see this functionality in a later update! Additionally, George Johnson is the name of three NBA players, and the scraper was having trouble with this so I dropped him.

In [14]:
mvp_share = mvp_share.drop(index=['George Johnson', 'Peja Stojaković', 'Manu Ginóbili', 'Goran Dragić', 'Nikola Jokić'])

<a id='pipeline'></a>
# Creating the Pipeline

Now that I have the cleaned data and know how the scraper works, it's time to start building the data processing pipeline. This will allow me generalize a set of functions that can work on any season/player and build the design matrix for linear regression.

<a id='teamstats'></a>
## Team Stats

The get_stats function of the web scraper is very useful for individual player stats, which will comprise the vast majority of the necessary features. However, team record plays a huge role, as only two MVP's since the 1980-1981 season have had team records with less than 50 wins. To account for this, I'll create a function that is essentially a copy of get_stats, but adds a column with the team's win percentage. 

I'll also drop some irrelevant columns like Team, League, GS, etc., as they will have little to no bearing on the prediction and may lead to overfitting.

In [15]:
kd_stats.columns

Index(['SEASON', 'AGE', 'TEAM', 'LEAGUE', 'POS', 'G', 'GS', 'MP', 'FG', 'FGA',
       'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA',
       'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [16]:
kd_stats.head()

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2007-08,19.0,SEA,NBA,SG,80.0,80.0,34.6,7.3,17.1,...,0.873,0.9,3.5,4.4,2.4,1.0,0.9,2.9,1.5,20.3
1,2008-09,20.0,OKC,NBA,SF,74.0,74.0,39.0,8.9,18.8,...,0.863,1.0,5.5,6.5,2.8,1.3,0.7,3.0,1.8,25.3
2,2009-10,21.0,OKC,NBA,SF,82.0,82.0,39.5,9.7,20.3,...,0.9,1.3,6.3,7.6,2.8,1.4,1.0,3.3,2.1,30.1
3,2010-11,22.0,OKC,NBA,SF,78.0,78.0,38.9,9.1,19.7,...,0.88,0.7,6.1,6.8,2.7,1.1,1.0,2.8,2.0,27.7
4,2011-12,23.0,OKC,NBA,SF,66.0,66.0,38.6,9.7,19.7,...,0.86,0.6,7.4,8.0,3.5,1.3,1.2,3.8,2.0,28.0


In [116]:
def get_stats_cleaned(name):
    ''' This function will be used to return player career stats with a cleaned SEASON column,
    added W/L percentage column, and removed TEAM, LEAGUE, GS, TRB, and PF columns.
    '''
    
    # get initial dataframe
    original = get_stats(name, stat_type='PER_GAME', playoffs=False, career=False)
    
    # clean seasons
    cleaned_season = []
    for i in original['SEASON']:
        if i == '1999-00':
            cleaned_season.append('2000')
        elif int(i.split('-')[0]) < 1999:
            cleaned_season.append('19' + i.split('-')[1])
        else:
            cleaned_season.append('20' + i.split('-')[1])
    cleaned_season = [int(i) for i in cleaned_season]
    original['SEASON'] = cleaned_season
    
    # remove traded seasons and ones out of range 
    original = original[original['TEAM'] != 'TOT']
    original = original[original['SEASON'] > 1980]
    
    # convert seasons back to string for indexing
    cleaned_season = [str(i) for i in original['SEASON']]
    original['SEASON'] = cleaned_season
    
    # replace old team acronyms with new
    original['TEAM'] = original['TEAM'].replace('WSB', 'WAS').replace('CHH', 'CHO')
    
    # drop years with messed up data (DNP injury or other leagues)
    original = original[original['LEAGUE'] == 'NBA']
    
    # add team W/L
    win_percentage = []
    for i in np.arange(original.shape[0]):
        wins = get_team_misc(original.iloc[i, 2], int(original.iloc[i, 0])).loc['W']
        losses = get_team_misc(original.iloc[i, 2], int(original.iloc[i, 0])).loc['L']
        win_percentage.append(wins / (wins + losses))
    original['Win %'] = win_percentage
    
    # drop necessary columns
    original = original.drop(columns=['TEAM', 'LEAGUE', 'GS', 'ORB', 'DRB', 'PF'])
    
    # rename 2P to FG
    original.rename(columns={'2P':'FG', '2PA':'FGA', '2P%':'FG%'}, inplace=True)
    
    # set index to season for next function to work
    original.set_index('SEASON', inplace=True)
    
    # drop 2020 data
    if '2020' in original.index.to_list():
        original = original.drop('2020')
        
    return original.fillna(original.mean())

I'll test the function below.

In [18]:
get_stats_cleaned('Kevin Durant')

Unnamed: 0_level_0,AGE,POS,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT,FTA,FT%,TRB,AST,STL,BLK,TOV,PTS,Win %
SEASON,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008,19.0,SG,80.0,34.6,7.3,17.1,0.43,0.7,2.6,0.288,...,4.9,5.6,0.873,4.4,2.4,1.0,0.9,2.9,20.3,0.243902
2009,20.0,SF,74.0,39.0,8.9,18.8,0.476,1.3,3.1,0.422,...,6.1,7.1,0.863,6.5,2.8,1.3,0.7,3.0,25.3,0.280488
2010,21.0,SF,82.0,39.5,9.7,20.3,0.476,1.6,4.3,0.365,...,9.2,10.2,0.9,7.6,2.8,1.4,1.0,3.3,30.1,0.609756
2011,22.0,SF,78.0,38.9,9.1,19.7,0.462,1.9,5.3,0.35,...,7.6,8.7,0.88,6.8,2.7,1.1,1.0,2.8,27.7,0.670732
2012,23.0,SF,66.0,38.6,9.7,19.7,0.496,2.0,5.2,0.387,...,6.5,7.6,0.86,8.0,3.5,1.3,1.2,3.8,28.0,0.712121
2013,24.0,SF,81.0,38.5,9.0,17.7,0.51,1.7,4.1,0.416,...,8.4,9.3,0.905,7.9,4.6,1.4,1.3,3.5,28.1,0.731707
2014,25.0,SF,81.0,38.5,10.5,20.8,0.503,2.4,6.1,0.391,...,8.7,9.9,0.873,7.4,5.5,1.3,0.7,3.5,32.0,0.719512
2015,26.0,SF,27.0,33.8,8.8,17.3,0.51,2.4,5.9,0.403,...,5.4,6.3,0.854,6.6,4.1,0.9,0.9,2.7,25.4,0.54878
2016,27.0,SF,72.0,35.8,9.7,19.2,0.505,2.6,6.7,0.387,...,6.2,6.9,0.898,8.2,5.0,1.0,1.2,3.5,28.2,0.670732
2017,28.0,PF,62.0,33.4,8.9,16.5,0.537,1.9,5.0,0.375,...,5.4,6.2,0.875,8.3,4.8,1.1,1.6,2.2,25.1,0.817073


<a id='mvpshare'></a>
## Adding MVP Share

Now that we have our cleaned data with all our necessary features for the design matrix, its time to add the corresponding MVP share for each season. This will allow for creation of the true voting share values.

In [117]:
def add_mvp_share(player):
    ''' This function retrieves a players seasonal stats and adds the MVP vote share they received for each season.
    '''
    
    # get cleaned dataframe
    original = get_stats_cleaned(player)
    
    # add mvp voting share
    mvp_shares = []
    # drop seasons with 0 mvp votes
    no_votes = []
    
    for i in original.index:
        if int(i) < 1981:
            vote = -1
        else:
            vote = mvp_share.loc[player, i]
       
        mvp_shares.append(vote)
        if vote < 0:
            no_votes.append(i)
            
    original['MVP Share'] = mvp_shares    
    original = original.drop(index=no_votes).reset_index()
    
    return original

I'll test the function below with a few notable players.

In [20]:
add_mvp_share('Kevin Durant')

Unnamed: 0,SEASON,AGE,POS,G,MP,FG,FGA,FG%,3P,3PA,...,FTA,FT%,TRB,AST,STL,BLK,TOV,PTS,Win %,MVP Share
0,2008,19.0,SG,80.0,34.6,7.3,17.1,0.43,0.7,2.6,...,5.6,0.873,4.4,2.4,1.0,0.9,2.9,20.3,0.243902,0.0
1,2009,20.0,SF,74.0,39.0,8.9,18.8,0.476,1.3,3.1,...,7.1,0.863,6.5,2.8,1.3,0.7,3.0,25.3,0.280488,0.0
2,2010,21.0,SF,82.0,39.5,9.7,20.3,0.476,1.6,4.3,...,10.2,0.9,7.6,2.8,1.4,1.0,3.3,30.1,0.609756,0.495
3,2011,22.0,SF,78.0,38.9,9.1,19.7,0.462,1.9,5.3,...,8.7,0.88,6.8,2.7,1.1,1.0,2.8,27.7,0.670732,0.157
4,2012,23.0,SF,66.0,38.6,9.7,19.7,0.496,2.0,5.2,...,7.6,0.86,8.0,3.5,1.3,1.2,3.8,28.0,0.712121,0.735
5,2013,24.0,SF,81.0,38.5,9.0,17.7,0.51,1.7,4.1,...,9.3,0.905,7.9,4.6,1.4,1.3,3.5,28.1,0.731707,0.632
6,2014,25.0,SF,81.0,38.5,10.5,20.8,0.503,2.4,6.1,...,9.9,0.873,7.4,5.5,1.3,0.7,3.5,32.0,0.719512,0.986
7,2015,26.0,SF,27.0,33.8,8.8,17.3,0.51,2.4,5.9,...,6.3,0.854,6.6,4.1,0.9,0.9,2.7,25.4,0.54878,0.0
8,2016,27.0,SF,72.0,35.8,9.7,19.2,0.505,2.6,6.7,...,6.9,0.898,8.2,5.0,1.0,1.2,3.5,28.2,0.670732,0.112
9,2017,28.0,PF,62.0,33.4,8.9,16.5,0.537,1.9,5.0,...,6.2,0.875,8.3,4.8,1.1,1.6,2.2,25.1,0.817073,0.002


In [21]:
add_mvp_share('LeBron James')

Unnamed: 0,SEASON,AGE,POS,G,MP,FG,FGA,FG%,3P,3PA,...,FTA,FT%,TRB,AST,STL,BLK,TOV,PTS,Win %,MVP Share
0,2004,19.0,SG,79.0,39.5,7.9,18.9,0.417,0.8,2.7,...,5.8,0.754,5.5,5.9,1.6,0.7,3.5,20.9,0.426829,0.009
1,2005,20.0,SF,80.0,42.4,9.9,21.1,0.472,1.4,3.9,...,8.0,0.75,7.4,7.2,2.2,0.7,3.3,27.2,0.512195,0.073
2,2006,21.0,SF,79.0,42.5,11.1,23.1,0.48,1.6,4.8,...,10.3,0.738,7.0,6.6,1.6,0.8,3.3,31.4,0.609756,0.55
3,2007,22.0,SF,78.0,40.9,9.9,20.8,0.476,1.3,4.0,...,9.0,0.698,6.7,6.0,1.6,0.7,3.2,27.3,0.609756,0.142
4,2008,23.0,SF,75.0,40.4,10.6,21.9,0.484,1.5,4.8,...,10.3,0.712,7.9,7.2,1.8,1.1,3.4,30.0,0.54878,0.348
5,2009,24.0,SF,81.0,37.7,9.7,19.9,0.489,1.6,4.7,...,9.4,0.78,7.6,7.2,1.7,1.1,3.0,28.4,0.804878,0.969
6,2010,25.0,SF,76.0,39.0,10.1,20.1,0.503,1.7,5.1,...,10.2,0.767,7.3,8.6,1.6,1.0,3.4,29.7,0.743902,0.98
7,2011,26.0,SF,79.0,38.8,9.6,18.8,0.51,1.2,3.5,...,8.4,0.759,7.5,7.0,1.6,0.6,3.6,26.7,0.707317,0.431
8,2012,27.0,SF,62.0,37.5,10.0,18.9,0.531,0.9,2.4,...,8.1,0.771,7.9,6.2,1.9,0.8,3.4,27.1,0.69697,0.888
9,2013,28.0,PF,76.0,37.9,10.1,17.8,0.565,1.4,3.3,...,7.0,0.753,8.0,7.3,1.7,0.9,3.0,26.8,0.804878,0.998


It looks as though the function is working properly: we now have a dataframe with all relevant player stats and their mvp share in each season they received votes.

<a id='ttsplit'></a>
# Training and Testing

Now that we have a pipeline of functions to source our training data, it's time to begin building the design matrix, separate the data into training and test, as well as create an evaluation set consisting of the current season's stats.

<a id='designmatrix'></a>
## Building the Design Matrix

The design matrix will essentially be the resulting dataframe after calling add_mvp_votes but with the SEASON columns dropped. Additionally, the categorical variable of POS (player position) will need to be one-hot encoded.

In [124]:
def design_matrix(players):
    # get data
    temp = add_mvp_share(players.reset_index().iloc[0, 0])
    for player in players.iloc[1:10, :].index.to_list():
        print(player)
        new = add_mvp_share(player)
        temp = temp.append(new.fillna(0))
    
    # one-hot encode position
    encoded = pd.get_dummies(temp['POS'], drop_first=True)
    
    # add encoded variables
    final = pd.concat([temp, encoded], axis=1).drop(columns=['SEASON', 'POS'])
    
    return final

In [125]:
first_10 = design_matrix(mvp_share)
first_10

Bob Lanier
Tiny Archibald
Jamaal Wilkes
Truck Robinson
Julius Erving
Moses Malone
Maurice Lucas
Artis Gilmore
George Gervin


Unnamed: 0,AGE,G,MP,FG,FGA,FG%,3P,3PA,3P%,FG.1,...,STL,BLK,TOV,PTS,Win %,MVP Share,PF,PG,SF,SG
0,33.0,80,37.2,10.5,18.2,0.574,0,0,0,10.5,...,0.7,2.9,3.1,26.2,0.658537,0.414,0,0,0,0
1,34.0,76,35.2,9.9,17.1,0.579,0,0,0,9.9,...,0.8,2.7,3,23.9,0.695122,0.045,0,0,0,0
2,35.0,79,32.3,9.1,15.5,0.588,0,0,0,9.1,...,0.8,2.2,2.5,21.8,0.707317,0.020,0,0,0,0
3,36.0,80,32.8,9,15.5,0.578,0,0,0,9,...,0.7,1.8,2.8,21.5,0.658537,0.201,0,0,0,0
4,37.0,79,33.3,9.2,15.3,0.599,0,0,0,9.2,...,0.8,2.1,2.5,22,0.756098,0.264,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,29.0,79,35.7,12.6,25.2,.500,0.1,0.5,.278,12.4,...,1.0,0.6,2.7,32.3,0.585366,0.159,0,0,0,1
2,30.0,78,36.3,9.7,19.9,.487,0.2,0.4,.364,9.6,...,1.1,0.9,3.2,26.2,0.646341,0.040,0,0,0,1
3,31.0,76,34.0,10.1,20.5,.490,0.1,0.3,.417,9.9,...,1.0,0.6,2.9,25.9,0.451220,0.000,0,0,0,1
4,32.0,72,29.0,8.3,16.4,.508,0.0,0.1,.000,8.3,...,0.9,0.7,2.8,21.2,0.500000,0.000,0,0,0,1


<a id='split'></a>
## Training-Test Split

<a id='2020stats'></a>
## 2019-2020 Season Evaluation Set

<a id='cv'></a>
# Cross-Validation

<a id='conclusion'></a>
# Conclusion