# MVP Machine Learning

This is part 3 of 3 for Machine Learning NBA MVP Prediction. In this final notebook, I will be performing a statistical analysis of the dataframes we scraped and cleaned in the previous parts. The goal is to use various Machine Learning techniques to build a model that predicts who should win MVP this year based on various predictors. Then I will diagnose the performance of the model with an average precision error metrics and see how confident I am with the results.

#### Model Building

In [1]:
import pandas as pd

In [2]:
#Df from MVP Cleaning notebook
stats = pd.read_csv("player_mvp_stats.csv", index_col = 0)
stats

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,Pts Won,Pts Max,Share,W,L,W/L%,GB,PS/G,PA/G,SRS
0,A.C. Green,SF,31.0,Phoenix Suns,82.0,52.0,32.8,3.8,7.5,0.504,...,0.0,0.0,0.000,59.0,23.0,0.720,0.0,110.6,106.8,3.86
1,Aaron Swinson,SF,24.0,Phoenix Suns,9.0,0.0,5.7,1.1,2.0,0.556,...,0.0,0.0,0.000,59.0,23.0,0.720,0.0,110.6,106.8,3.86
2,Antonio Lang,SF,22.0,Phoenix Suns,12.0,0.0,4.4,0.3,0.8,0.400,...,0.0,0.0,0.000,59.0,23.0,0.720,0.0,110.6,106.8,3.86
3,Charles Barkley,PF,31.0,Phoenix Suns,68.0,66.0,35.0,8.1,16.8,0.486,...,96.0,1050.0,0.091,59.0,23.0,0.720,0.0,110.6,106.8,3.86
4,Dan Majerle,SF,29.0,Phoenix Suns,82.0,46.0,37.7,5.3,12.6,0.425,...,0.0,0.0,0.000,59.0,23.0,0.720,0.0,110.6,106.8,3.86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13697,,,,Memphis Grizzlies (2),,,,,,,...,,,,51.0,31.0,0.622,0.0,116.9,113.0,3.60
13698,,,,New Orleans Pelicans (9),,,,,,,...,,,,42.0,40.0,0.512,9.0,114.4,112.5,1.63
13699,,,,Dallas Mavericks (11),,,,,,,...,,,,38.0,44.0,0.463,13.0,114.2,114.1,-0.14
13700,,,,Houston Rockets (14),,,,,,,...,,,,22.0,60.0,0.268,29.0,110.7,118.6,-7.62


In [3]:
stats = stats.fillna(0) #Replace Nan cells with 0

In [4]:
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G',
       'SRS'],
      dtype='object')

In [5]:
#Only numeric predictors
predictors = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
                    '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
                    'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
                    'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']

In [6]:
#Split data into train and test sets
#Train with all data until previous season, test with current season
train = stats[stats['Year'] < 2023]
test = stats[stats['Year']== 2023]

I will be using Ridge Regression for this model because of the high amount of predictor variables

In [7]:
from sklearn.linear_model import Ridge

#Build model, alpha = .1
reg = Ridge(alpha=.1)
reg.fit(train[predictors], train['Share']) #Use predictors to try and predict Share



Ridge(alpha=0.1)

In [8]:
#Model predictions
predictions = reg.predict(test[predictors])
predictions

array([0.06227042, 0.08957381, 0.09573687, 0.28209301, 0.05859741,
       0.05100784, 0.05956029, 0.04942866, 0.05655439, 0.11491003,
       0.09672365, 0.06709113, 0.05002883, 0.04832838, 0.05614313,
       0.04693411, 0.0585244 , 0.04651862, 0.06441561, 0.02513577,
       0.1226423 , 0.05180961, 0.06818716, 0.07020785, 0.04923785,
       0.06156266, 0.06524614, 0.04220506, 0.04967255, 0.07605618,
       0.14582261, 0.26026606, 0.06159601, 0.03713855, 0.05821214,
       0.02940861, 0.06561919, 0.06167003, 0.06037353, 0.0528606 ,
       0.047066  , 0.06530882, 0.05937179, 0.0960837 , 0.06033525,
       0.10123909, 0.07433943, 0.06197443, 0.07132032, 0.05413998,
       0.07645657, 0.07074222, 0.07013704, 0.16107301, 0.04081159,
       0.05120152, 0.05933613, 0.11365525, 0.06607978, 0.04916629,
       0.04659031, 0.05125835, 0.06145797, 0.09971805, 0.05801689,
       0.04997174, 0.08618675, 0.23176301, 0.06141713, 0.05060202,
       0.08782129, 0.05580355, 0.05087565, 0.04878785, 0.04348

In [9]:
predictions = pd.DataFrame(predictions, columns=['predictions'], index = test.index)
predictions

Unnamed: 0,predictions
125,0.062270
126,0.089574
127,0.095737
128,0.282093
129,0.058597
...,...
13697,0.016933
13698,0.015022
13699,0.013883
13700,0.011212


In [10]:
#Concat test and predictions dfs on columns
combination = pd.concat([test[['Player','Share']], predictions], axis = 1) 
combination

Unnamed: 0,Player,Share,predictions
125,A.J. Green,0.000,0.062270
126,Bobby Portis,0.000,0.089574
127,Brook Lopez,0.000,0.095737
128,Giannis Antetokounmpo,0.606,0.282093
129,Goran Dragić,0.000,0.058597
...,...,...,...
13697,0,0.000,0.016933
13698,0,0.000,0.015022
13699,0,0.000,0.013883
13700,0,0.000,0.011212


In [11]:
from sklearn.metrics import mean_squared_error

#Find MSE error term
mean_squared_error(combination['Share'], combination['predictions'])

0.006748063124983466

The model built by the ridge regression analysis has an MSE of 0.00675, which is a low value indicating the model is a good fit at the 0.1 level.

#### Model Analysis

In [12]:
#Sort by MVP race finish and add a Rank column
combination = combination.sort_values('Share',ascending=False)
combination['Rk'] = list(range(1,combination.shape[0]+1))
combination.head(13)

Unnamed: 0,Player,Share,predictions,Rk
13369,Joel Embiid,0.915,0.256031,1
636,Nikola Jokić,0.674,0.231763,2
128,Giannis Antetokounmpo,0.606,0.282093,3
2695,Jayson Tatum,0.28,0.196493,4
1222,Shai Gilgeous-Alexander,0.046,0.213335,5
12338,Donovan Mitchell,0.03,0.143678,6
3943,Domantas Sabonis,0.027,0.162008,7
220,Luka Dončić,0.01,0.260266,8
6202,Stephen Curry,0.005,0.1745,9
9860,Jimmy Butler,0.003,0.171981,10


In [13]:
#Add predicted ranking column
combination = combination.sort_values('predictions', ascending=False)
combination['Pred Rk'] = list(range(1, combination.shape[0]+1))
combination.head(10)

Unnamed: 0,Player,Share,predictions,Rk,Pred Rk
128,Giannis Antetokounmpo,0.606,0.282093,3,1
220,Luka Dončić,0.01,0.260266,8,2
13369,Joel Embiid,0.915,0.256031,1,3
636,Nikola Jokić,0.674,0.231763,2,4
1222,Shai Gilgeous-Alexander,0.046,0.213335,5,5
7464,Damian Lillard,0.0,0.204954,100,6
10432,Kevin Durant,0.0,0.201954,69,7
7796,Anthony Davis,0.0,0.201328,83,8
7803,LeBron James,0.0,0.197522,90,9
2695,Jayson Tatum,0.28,0.196493,4,10


In [14]:
#Create average precision error metric to count whether we correctly predicted a player to finish top 5 in MVP voting
def find_ap(combination):
    actual = combination.sort_values('Share',ascending=False).head(5)
    predicted = combination.sort_values('predictions', ascending=False)
    ps = []
    found = 0
    seen = 1
    for index, row in predicted.iterrows():
        if row['Player'] in actual['Player'].values:
            found += 1
            ps.append(found/seen)
        seen += 1
    return sum(ps)/len(ps)

In [15]:
find_ap(combination)

0.7433333333333334

My model was able to predict the top 5 MVP vote finishers of 2023 with an average precision of 0.743. Since this number is relatively close to 1, I can assume the model provided a good fit for the season.

Here are some conclusions drawns from the model output for the 2023 season:
- Giannis Antetokounmpo is the predicted MVP while finishing 3rd in actual voting, while Joel Embiid was the true MVP while finishing 3rd in predicted rank.
- Luka Doncic finished 8th in MVP voting but was ranked 2nd in the predicted ranking
- Players ranked 6-10 in my predicted ranking did not receive a single MVP vote (Damian Lillard, Kevin Durant, Anthony Davis, LeBron James)

#### Backtesting and Validation

Here I will backtest the algorithm with data since 1995 to see if we can feel confident with the model.

In [16]:
years = list(range(1995,2023))

In [17]:
aps = [] #average presicion score
all_predictions = [] #list of predictions from every year

#Gather at least 5 years worth of data to train
for year in years[5:]:
    train = stats[stats['Year'] < year]
    test = stats[stats['Year'] == year]
    
    reg.fit(train[predictors], train['Share'])
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions, columns=['predictions'],index=test.index)
    combination = pd.concat([test[['Player','Share']],predictions],axis=1)
    all_predictions.append(combination)
    aps.append(find_ap(combination))

In [18]:
#Calculate average precision score
sum(aps)/len(aps)

0.7363349607222165

The average precision score of all the years is 0.736, which is very close to the AP score for 2023 of 0.743.

In [19]:
all_predictions

[                 Player  Share  predictions
 77           A.C. Green    0.0     0.000824
 78           Brian Shaw    0.0    -0.000194
 79         Derek Fisher    0.0    -0.000739
 80        Devean George    0.0     0.007768
 81            Glen Rice    0.0     0.049231
 ...                 ...    ...          ...
 13650  Mark Hendrickson    0.0    -0.022199
 13651      Michael Cage    0.0    -0.023841
 13652     Scott Burrell    0.0     0.002568
 13653   Sherman Douglas    0.0    -0.011696
 13654   Stephon Marbury    0.0     0.053573
 
 [439 rows x 3 columns],
                 Player  Share  predictions
 91          A.C. Green  0.000     0.010041
 92     Alonzo Mourning  0.000     0.070188
 93      Anthony Carter  0.000    -0.011787
 94       Anthony Mason  0.001     0.052682
 95         Brian Grant  0.000     0.028953
 ...                ...    ...          ...
 13454     Rafer Alston  0.000     0.006986
 13455        Ray Allen  0.006     0.041920
 13456      Sam Cassell  0.000     0.

In [20]:
#Add in predicted rank and difference in rank to all players
def add_ranks(combination):
    combination = combination.sort_values('Share', ascending = False)
    combination['Rk'] = list(range(1,combination.shape[0]+1))
    combination = combination.sort_values('predictions', ascending=False)
    combination['Pred Rk'] = list(range(1, combination.shape[0]+1))
    combination['Diff'] = combination['Rk']-combination['Pred Rk']
    return combination

In [21]:
ranking = add_ranks(all_predictions[1])
#ranking[ranking['Rk'] <6].sort_values('Diff',ascending = False)
ranking.head()

Unnamed: 0,Player,Share,predictions,Rk,Pred Rk,Diff
11969,Shaquille O'Neal,0.466,0.272036,3,1,2
9348,Chris Webber,0.42,0.157423,4,2,2
8817,Tim Duncan,0.569,0.137012,2,3,-1
12212,Karl Malone,0.017,0.134758,7,4,3
1003,Allen Iverson,0.904,0.123538,1,5,-4


In [22]:
#Function to backtest all years
def backtest(stats, model, year, predictors):
    aps = []
    all_predictions = []

    for year in years[5:]:
        train = stats[stats['Year'] < year]
        test = stats[stats['Year'] == year]
        
        model.fit(train[predictors], train['Share'])
        predictions = reg.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns=['predictions'],index=test.index)
        combination = pd.concat([test[['Player','Share']],predictions],axis=1)
        combinations = add_ranks(combination) #Add ranks as column to df
        aps.append(find_ap(combinations))
        all_predictions.append(combinations)
        all_predictions_df = pd.concat(all_predictions)
    return sum(aps)/len(aps), aps, all_predictions_df

In [23]:
mean_ap, aps, all_predictions_df = backtest(stats, reg, years[5:], predictors)

In [24]:
mean_ap

0.7363349607222165

In [25]:
#View biggest differences in predicted vs actual ranking
all_predictions_df[all_predictions_df['Rk'] <= 5].sort_values('Diff').head(10)

Unnamed: 0,Player,Share,predictions,Rk,Pred Rk,Diff
1324,Jason Kidd,0.712,0.015312,2,100,-98
5190,Steve Nash,0.839,0.041184,1,41,-40
12428,Joakim Noah,0.258,0.044454,4,40,-36
8453,Peja Stojaković,0.228,0.034305,4,39,-35
5208,Steve Nash,0.739,0.060247,1,31,-30
1489,Chris Paul,0.138,0.071483,5,33,-28
3694,Chauncey Billups,0.344,0.056507,5,33,-28
880,Devin Booker,0.216,0.088936,4,17,-13
5223,Steve Nash,0.785,0.085181,2,14,-12
6815,Kobe Bryant,0.291,0.076329,4,15,-11


In [26]:
#View predictors with highest coefficients
pd.concat([pd.Series(reg.coef_), pd.Series(predictors)], axis=1).sort_values(0, ascending = False)

Unnamed: 0,0,1
13,0.083801,eFG%
18,0.033488,DRB
17,0.021747,ORB
29,0.019209,W/L%
10,0.016285,2P
15,0.011597,FTA
21,0.00969,STL
22,0.008639,BLK
25,0.007895,PTS
20,0.007162,AST


In the dataframe above we can see which statistics contribute the most to MVP vote share, with eFG% (effective field goal percentage), DRB (defensive rebounds), and ORB (offensive rebounds). Overall the stats at the top are all the major NBA statistical categories, which makes sense because that is generally how we view the general success and skill of a player in the league.

#### Model Improvement

Here I am testing to see if adding predictors for stats ratios will increase the average precision score of the model. Stats ratio will compare the season averages of all the players to the season average of a single player. If the season average of the individual player is much greater than the league average, then the model will reward that player with greater predicted MVP vote share.

In [27]:
#Create stat ratios columns for the major statical categories
stat_ratios = stats[['PTS','AST','STL','BLK','3P','Year']].groupby('Year').apply(lambda x: x/x.mean())
stat_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,Year
0,1.319103,0.758048,0.991605,0.931570,1.107162,1.0
1,0.317998,0.151610,0.141658,0.000000,0.000000,1.0
2,0.105999,0.050537,0.000000,0.465785,0.000000,1.0
3,2.708873,2.071999,2.266527,1.630247,2.435757,1.0
4,1.837322,2.071999,1.699895,1.164462,5.314379,1.0
...,...,...,...,...,...,...
13697,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
13698,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
13699,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
13700,0.000000,0.000000,0.000000,0.000000,0.000000,1.0


In [28]:
#Add the stat ratio columns to the main dataframe
stats[['PTS_T','AST_R','STL_R','BLK_R','3P_R']] = stat_ratios[['PTS','AST','STL','BLK','3P']]
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,W/L%,GB,PS/G,PA/G,SRS,PTS_T,AST_R,STL_R,BLK_R,3P_R
0,A.C. Green,SF,31.0,Phoenix Suns,82.0,52.0,32.8,3.8,7.5,0.504,...,0.72,0.0,110.6,106.8,3.86,1.319103,0.758048,0.991605,0.93157,1.107162
1,Aaron Swinson,SF,24.0,Phoenix Suns,9.0,0.0,5.7,1.1,2.0,0.556,...,0.72,0.0,110.6,106.8,3.86,0.317998,0.15161,0.141658,0.0,0.0
2,Antonio Lang,SF,22.0,Phoenix Suns,12.0,0.0,4.4,0.3,0.8,0.4,...,0.72,0.0,110.6,106.8,3.86,0.105999,0.050537,0.0,0.465785,0.0
3,Charles Barkley,PF,31.0,Phoenix Suns,68.0,66.0,35.0,8.1,16.8,0.486,...,0.72,0.0,110.6,106.8,3.86,2.708873,2.071999,2.266527,1.630247,2.435757
4,Dan Majerle,SF,29.0,Phoenix Suns,82.0,46.0,37.7,5.3,12.6,0.425,...,0.72,0.0,110.6,106.8,3.86,1.837322,2.071999,1.699895,1.164462,5.314379


In [29]:
#Add the stat ratios to the predictors variable
predictors += ['PTS_T','AST_R','STL_R','BLK_R','3P_R']

In [30]:
mean_ap, aps, all_predictions_df = backtest(stats, reg, years[5:], predictors)

mean_ap

0.7249875876057059

The average precision score of 0.725 does not create a model that is much weaker than the original.

Here I am testing to see if a players' team and position will increase average precision score. In the dataframe that displayed the largest differences between predicted vs. actual MVP ranking, a majority of the players listed are guards and play for the Suns, but do these variables actually affect the rankings?

In [31]:
#Create numerical columns for position and team
stats['NPos'] = stats['Pos'].astype('category').cat.codes
stats['NTm'] = stats['Tm'].astype('category').cat.codes

In [32]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,PS/G,PA/G,SRS,PTS_T,AST_R,STL_R,BLK_R,3P_R,NPos,NTm
0,A.C. Green,SF,31.0,Phoenix Suns,82.0,52.0,32.8,3.8,7.5,0.504,...,110.6,106.8,3.86,1.319103,0.758048,0.991605,0.93157,1.107162,9,50
1,Aaron Swinson,SF,24.0,Phoenix Suns,9.0,0.0,5.7,1.1,2.0,0.556,...,110.6,106.8,3.86,0.317998,0.15161,0.141658,0.0,0.0,9,50
2,Antonio Lang,SF,22.0,Phoenix Suns,12.0,0.0,4.4,0.3,0.8,0.4,...,110.6,106.8,3.86,0.105999,0.050537,0.0,0.465785,0.0,9,50
3,Charles Barkley,PF,31.0,Phoenix Suns,68.0,66.0,35.0,8.1,16.8,0.486,...,110.6,106.8,3.86,2.708873,2.071999,2.266527,1.630247,2.435757,3,50
4,Dan Majerle,SF,29.0,Phoenix Suns,82.0,46.0,37.7,5.3,12.6,0.425,...,110.6,106.8,3.86,1.837322,2.071999,1.699895,1.164462,5.314379,9,50


In [33]:
#Use random forest regression to find relationship with position and team 
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=1, min_samples_split=5)

#Only running random forest from 2020-present because the random forest runs very slowly
mean_ap, aps, all_predictions_df = backtest(stats, rf, years[25:], predictors)

In [34]:
#Mean ap for random forest set
mean_ap

0.7796968899842771

In [35]:
mean_ap, aps, all_predictions_df = backtest(stats, reg, years[25:], predictors)

In [36]:
#Mean ap for regular set
mean_ap

0.7249875876057059

The average precision score for the random forest was 0.780, while the score for the regular model was 0.725. This shows the random forest model that includes the player's team and position are good predictors that should be included in the full model.

#### Conclusion

In this project, I was able to create a Machine Learning model that predicts a player's MVP vote share for the season. Some additional variables were tested and added that are not typically looked at when analyzing NBA stats, such as stat ratios, position, and team that a player is on. Although the player that is predicted to rank first does not always win MVP, the model is able to accurately predict the top 5 vote receivers with some degrees of variance.

There are also some outside variable I was not able to test that could affect MVP voting, 3 of which I will describe and give examples for below.
1. Recency bias: Players typically go through highs and lows for during the season, but MVP voters tend to judge a player more so on games that occurred after the All-Star break and near the end of the season. For the 2022-23 season, Luka Doncic was a near favorite to win the award at the beginning of the season, but he and his team did not close the year out well (Dallas Mavericks didn't even make the play-in tournament) which caused him to finish 8th in MVP voting. Our model however does not take recency bias into account and ended up ranking him 2nd overall. Adding a variable for recency bias would give more weight to games at the end of the season, and less weight to games at the start.
2. Strength of team: MVP voters tend to vote more favorably for a player that does not have a great team around him. In the 2016-17 season, Russell Westbrook averaged a triple double for the entire year but his team finished 6th in the Western Conference. During the same season, James Harden nearly averaged a triple double himself (2 rebounds shy) and Steph Curry led the Warriors to the best record in the NBA. Despite finishing 6th seed Russell Westbrook ended up winning the MVP award, becoming the lowest seeded player in history to win. One of the narratives in his favor was that his entire team was bad (Kevin Durant just left) and he had to carry his team of scrubs the whole year. A variable for strength of team would increase or decrease a player's MVP vote share based on the number of All-Star teammates he has.
3. Voter fatigue: It is rare to see a back-to-back MVP in the league, and even more rare for someone to win it 3 times in a row. Even if a player deserves to win the award consecutive times, MVP voters tend to want to vote for another player rather than giving it to the same person every year. This was evident for this 2022-2023 season as Nikola Jokic seemed to be the most deserving player to win MVP (1st seed and highest efficiency in the league), however he did not win the award since he had already won it back-to-back years. Adding a variable for voter fatigue would decrease a player's MVP vote share if he won the MVP the previous year.

In the future I would like to find a way to test the 3 variables mentioned above to see if they are strong variables that can be included in the model. Overall, this was a great study to see which variables affect MVP vote share and I will continue to use it to see how it holds up in the upcoming seasons.