# NBA MVP Prediction

### Here we use the data we scraped and cleaned from the NBA website to train a machine learning algorithm to predict the next MVP
We begin by importing Pandas library to handle dataframes

In [1]:
import pandas as pd    #importing pandas with alias pd

In [2]:
stats = pd.read_csv("player_mvp_stats.csv")   #reading player_mvp_stats.csv into dataframe

In [3]:
stats    #viewing dataframe

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
1,1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
2,2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
3,3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
4,4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14087,14087,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14088,14088,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14089,14089,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14090,14090,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45


By inspecting the dataframe we can see that we have a column 'Unnamed: 0' that provides no useful info so we delete that and we also check to see how many columns have null values or inappropriate data type

In [4]:
del stats["Unnamed: 0"]  # deleting column

In [5]:
pd.isnull(stats).sum()   #checking for columns with null vales

Player        0
Pos           0
Age           0
Tm            0
G             0
GS            0
MP            0
FG            0
FGA           0
FG%          50
3P            0
3PA           0
3P%        2042
2P            0
2PA           0
2P%          84
eFG%         50
FT            0
FTA           0
FT%         462
ORB           0
DRB           0
TRB           0
AST           0
STL           0
BLK           0
TOV           0
PF            0
PTS           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
W             0
L             0
W/L%          0
GB            0
PS/G          0
PA/G          0
SRS           0
dtype: int64

Looking at the results above, we can see that the columns with null values all show percentages. For example, we can see that the '3P%' column, which shows the % of succesfull free throws attempted by a player, a value of 0 suggests that no free throw was attempted bythe player. To confirm out guess we do some more exploration.

In [6]:
stats[pd.isnull(stats['3P%'])][['Player','3PA']]   #filtering to show only rows with null values of 'Player' and '3PA'column

Unnamed: 0,Player,3PA
2,Elden Campbell,0.0
3,Irving Thomas,0.0
18,Jack Haley,0.0
20,Keith Owens,0.0
30,Benoit Benjamin,0.0
...,...,...
14061,Evan Eschmeyer,0.0
14062,Gheorghe Mureșan,0.0
14064,Jim McIlvaine,0.0
14070,Mark Hendrickson,0.0


We can see tha the players with null values in '3P%' also have no free throws attempted. Let's also explore another column with null values.

In [7]:
stats[pd.isnull(stats['FT%'])][['Player','FTA']]   #filtering to show only rows with null values of 'Player' and 'FTA'column

Unnamed: 0,Player,FTA
77,John Coker,0.0
92,Jason Sasser,0.0
103,Adrian Caldwell,0.0
119,Bruno Šundov,0.0
158,Jamal Robinson,0.0
...,...,...
13951,Mark McNamara,0.0
13979,Luke Zeller,0.0
14032,Myron Brown,0.0
14054,Malcolm Lee,0.0


We can see that our guess was accurate, so we go ahead and fill the rows with with zero so as to help out algorithm

In [8]:
stats = stats.fillna(0)   #use .fillna method to fill in null values

Now we begin to look into the columns in the dataframe to see which will be usefull for our ML model. For this model we only use numerisc values.

In [9]:
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS'],
      dtype='object')

We won't use the 'Player', 'Tm', 'Team' and 'Pos' columns cause they are strings. We also don't use the 'Pts Won', 'Pts Max', 'Share' column because these are too close to the values that we want to predict, i.e they are highly correlated.

Then we create a list containing the columns we want to use in our model, this is helpful so we don't forget about any when it's time to use the model

In [10]:
predictors = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS']

Next we begin to train our prediction model. As the data is in a time series, we want to make sure that our test data doesn not preceed our training data so as not to cause the algorithm to overfit.

We will be using a Linear Regression model called Ridge to train and test the data so as to minimize overfitting.

In [11]:
train = stats[stats["Year"] < 2021]    #assigning data to the `train` variable

In [12]:
test = stats[stats["Year"] == 2021]    #assigning data to the `test` variable

In [13]:
from sklearn.linear_model import Ridge   #importing Ridge model from scikit learn

reg = Ridge(alpha=.1)      #initializing the Ridge model with 0.1 coefficient

In [14]:
reg.fit(train[predictors], train["Share"])    #fitting the model

Ridge(alpha=0.1)

In [15]:
predictions = reg.predict(test[predictors])        #making predictions using the 'predictors'

In [16]:
predictions = pd.DataFrame(predictions, columns=["predictions"], index=test.index)   # converting predictions to dataframe

In [17]:
predictions

Unnamed: 0,predictions
630,0.013567
631,-0.013756
632,0.002414
633,-0.004421
634,0.010734
...,...
13897,-0.012571
13898,-0.011575
13899,0.016424
13900,-0.020434


Let us compare our predictions to the actual values. We only need the 'Player' and 'Share' columns as these are the predictions we are trying to make.

In [18]:
combination = pd.concat([test[["Player", "Share"]], predictions], axis=1)

In [19]:
combination

Unnamed: 0,Player,Share,predictions
630,Aaron Gordon,0.0,0.013567
631,Austin Rivers,0.0,-0.013756
632,Bol Bol,0.0,0.002414
633,Facundo Campazzo,0.0,-0.004421
634,Greg Whittington,0.0,0.010734
...,...,...,...
13897,Patty Mills,0.0,-0.012571
13898,Quinndary Weatherspoon,0.0,-0.011575
13899,Rudy Gay,0.0,0.016424
13900,Tre Jones,0.0,-0.020434


Let's sort the dataframe by 'Share' so we can see the actual winners of the MVP

In [20]:
combination.sort_values("Share", ascending=False).head(10)

Unnamed: 0,Player,Share,predictions
641,Nikola Jokić,0.961,0.154306
8624,Joel Embiid,0.58,0.162713
3651,Stephen Curry,0.449,0.142386
9907,Giannis Antetokounmpo,0.345,0.207436
1389,Chris Paul,0.138,0.072293
10997,Luka Dončić,0.042,0.15143
7464,Damian Lillard,0.038,0.116303
3536,Julius Randle,0.02,0.088877
3531,Derrick Rose,0.01,0.033001
11358,Rudy Gobert,0.008,0.09535


We can see that there are variations between the actual values and predicted values. This is expected, but we have to figure out what level of error is acceptable in order to determine if the algorithm works.

Next we try to identify a suitable error metric to be used. We try a default error metric from SciKit learn called mean_squared_error.

In [21]:
from sklearn.metrics import mean_squared_error

mean_squared_error(combination["Share"], combination["predictions"])

0.0026668960013828723

In [22]:
combination["Share"].value_counts()

0.000    525
0.001      3
0.961      1
0.138      1
0.010      1
0.020      1
0.449      1
0.005      1
0.038      1
0.003      1
0.580      1
0.345      1
0.042      1
0.008      1
Name: Share, dtype: int64

After checking the mean_squared_error of our predictions and also the value counts of the 'Share' column to be predicted, our results are in fractions which does not do us much good as we are trying to predict rank. We can see that a great number of the values are 0. This is normal as about 99% of players in the NBA do not get any MVP votes. 

Since we are only interested in the players who got votes by rank, this error metric does not work well.
To find a suitable error metric, let us sort the dataframe by rank using the 'Share' column.

In [23]:
combination = combination.sort_values("Share", ascending=False)  #sorting the dataframe by Share
combination["Rk"] = list(range(1,combination.shape[0]+1)) #creating column to store rank

In [24]:
combination.head(10)

Unnamed: 0,Player,Share,predictions,Rk
641,Nikola Jokić,0.961,0.154306,1
8624,Joel Embiid,0.58,0.162713,2
3651,Stephen Curry,0.449,0.142386,3
9907,Giannis Antetokounmpo,0.345,0.207436,4
1389,Chris Paul,0.138,0.072293,5
10997,Luka Dončić,0.042,0.15143,6
7464,Damian Lillard,0.038,0.116303,7
3536,Julius Randle,0.02,0.088877,8
3531,Derrick Rose,0.01,0.033001,9
11358,Rudy Gobert,0.008,0.09535,10


Now we sort out the dataframe using our predictions to rank them and savng to a 'Prediction Rk' column

In [25]:
combination = combination.sort_values("predictions", ascending=False)  #sorting dataframe
combination["Prediction Rk"] = list(range(1, combination.shape[0]+1))  #creating column to store prediction rank

In [26]:
combination.head(10)

Unnamed: 0,Player,Share,predictions,Rk,Prediction Rk
9907,Giannis Antetokounmpo,0.345,0.207436,4,1
8624,Joel Embiid,0.58,0.162713,2,2
641,Nikola Jokić,0.961,0.154306,1,3
10997,Luka Dončić,0.042,0.15143,6,4
3736,LeBron James,0.001,0.147512,15,5
3651,Stephen Curry,0.449,0.142386,3,6
4177,Kevin Durant,0.0,0.14135,531,7
4174,James Harden,0.001,0.140598,13,8
11784,Zion Williamson,0.0,0.127926,251,9
3876,Russell Westbrook,0.005,0.120227,11,10


Looking as the result above we can see that some predictions were close while some are very far away. One way we can define the error metric is to find out, of the players in the top 5 of MVP votes, how many did we correctly predict will be tin hat cartegory.

Let's sort the dataframe to show actual rank 

In [27]:
combination.sort_values("Share", ascending=False).head(10)

Unnamed: 0,Player,Share,predictions,Rk,Prediction Rk
641,Nikola Jokić,0.961,0.154306,1,3
8624,Joel Embiid,0.58,0.162713,2,2
3651,Stephen Curry,0.449,0.142386,3,6
9907,Giannis Antetokounmpo,0.345,0.207436,4,1
1389,Chris Paul,0.138,0.072293,5,33
10997,Luka Dončić,0.042,0.15143,6,4
7464,Damian Lillard,0.038,0.116303,7,12
3536,Julius Randle,0.02,0.088877,8,24
3531,Derrick Rose,0.01,0.033001,9,76
11358,Rudy Gobert,0.008,0.09535,10,19


The metric we are going to use for this task is called Average Precision, we use ths cause there is ranking involved in this task.
We create a function to calculate the avearge precision of each prediction

In [28]:
def find_ap(combination):
    actual = combination.sort_values("Share", ascending=False).head(5)
    predicted = combination.sort_values("predictions", ascending=False)
    ps = []
    found = 0
    seen = 1
    for index, row in predicted.iterrows():
        if row["Player"] in actual["Player"].values:
            found += 1
            ps.append(found/seen)
        seen += 1
    return sum(ps) / len(ps)

In [29]:
print('Error metric: {}'.format(find_ap(combination)))

Error metric: 0.7636363636363636


The value of the error metric gotten above is suitable for our model because it is close to 1
But we don't want to judge this algorithm by only testing it on one year, so we apply a Back Testing method so make our error metric more robust 


In [30]:
years = list(range(1991,2022))
aps = []  # list of average precision for each test year
all_predictions = []    # list of predictions for each test year
for year in years[5:]:  # loop for back testing
    train = stats[stats["Year"]< year]     # spliting training data
    test = stats[stats["Year"]==year]      #splitting test data
    reg.fit(train[predictors], train["Share"])    # fitting predictors
    predictions = reg.predict(test[predictors])   #  making predictions
    predictions = pd.DataFrame(predictions, columns=['predictions'], index=test.index)    # converting predictions to dataframe
    combination = pd.concat([test[["Player","Share"]], predictions], axis = 1)   
    all_predictions.append(combination)
    aps.append(find_ap(combination))

next we find the Mean Average Precision of all the tests carried out

In [31]:
sum(aps)/len(aps)

0.7112884360789578

We can go further to diagnose our model. We can creare columns with data that can help our model. To do that we create a function that creates two ranking colums and the difference between both. One rank is by actual number of votes and the other is rank by predictions. 

In [32]:
def add_ranks(combination):
    combination = combination.sort_values("Share", ascending=False)
    combination["Rk"] = list(range(1,combination.shape[0]+1))
    combination = combination.sort_values("predictions", ascending=False)
    combination["Predicted_Rk"] = list(range(1,combination.shape[0]+1))
    combination["Diff"] = combination["Rk"] - combination["Predicted_Rk"]
    return combination
    

In [33]:
add_ranks(all_predictions[1]).sort_values("Share", ascending=False) # calling the function on all predictions and sorting them

Unnamed: 0,Player,Share,predictions,Rk,Predicted_Rk,Diff
1600,Karl Malone,0.857,0.192318,1,2,-1
10524,Michael Jordan,0.832,0.167629,2,3,-1
908,Grant Hill,0.327,0.128646,3,6,-3
4682,Tim Hardaway,0.207,0.059984,4,20,-16
8248,Glen Rice,0.117,0.033110,5,53,-48
...,...,...,...,...,...,...
10136,Horacio Llamas,0.000,0.010171,62,156,-94
3576,Ennis Whatley,0.000,0.010250,259,155,104
10594,Kevin Salvadori,0.000,0.010553,215,154,61
1138,Aaron Williams,0.000,0.010594,332,153,179


Let's create a function that can handle all these do we can continue diagnosing our model

In [34]:
def backtest(stats, model, year, predictors):
    aps = []  # list of average precision for each test year
    all_predictions = []    # list of predictions for each test year
    for year in years[5:]:  # loop for back testing
        train = stats[stats["Year"]< year]     # spliting training data
        test = stats[stats["Year"]==year]      #splitting test data
        model.fit(train[predictors], train["Share"])    # fitting predictors
        predictions = model.predict(test[predictors])   #  making predictions
        predictions = pd.DataFrame(predictions, columns=['predictions'], index=test.index)    # converting predictions to dataframe
        combination = pd.concat([test[["Player","Share"]], predictions], axis = 1)
        combination = add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps)/len(aps), aps, pd.concat(all_predictions)

In [35]:
mean_ap_reg, aps, all_predictions = backtest(stats, reg, years[5:], predictors)

In [36]:
mean_ap_reg

0.7112884360789578

In [37]:
all_predictions[all_predictions["Rk"] <=5].sort_values("Diff").head(10)

Unnamed: 0,Player,Share,predictions,Rk,Predicted_Rk,Diff
1224,Jason Kidd,0.712,0.02821,2,52,-50
8248,Glen Rice,0.117,0.03311,5,53,-48
5175,Steve Nash,0.839,0.0341,1,45,-44
8516,Peja Stojaković,0.228,0.03627,4,38,-34
5193,Steve Nash,0.739,0.054129,1,34,-33
12726,Joakim Noah,0.258,0.046968,4,37,-33
3657,Chauncey Billups,0.344,0.052696,5,35,-30
1389,Chris Paul,0.138,0.072293,5,33,-28
5208,Steve Nash,0.785,0.074421,2,21,-19
4682,Tim Hardaway,0.207,0.059984,4,20,-16


Let's find the columns that are important to our algorithm by checking the coefficient

In [38]:
pd.concat([pd.Series(reg.coef_), pd.Series(predictors)], axis=1).sort_values(0,ascending=False)

Unnamed: 0,0,1
13,0.070001,eFG%
18,0.035041,DRB
29,0.027125,W/L%
17,0.02161,ORB
10,0.016945,2P
21,0.011635,STL
15,0.011351,FTA
22,0.011234,BLK
20,0.007455,AST
25,0.005894,PTS


let's provide more insight to our data by comparing some player attributes to that of the season's average

In [39]:
stat_ratios = stats[["PTS","AST","STL","BLK","3P","Year"]].groupby("Year").apply(lambda x: x/x.mean())

In [40]:
stat_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,Year
0,1.013334,0.420714,0.961127,0.673469,0.508587,1.0
1,1.614653,1.028412,1.647646,0.673469,4.577279,1.0
2,0.311795,0.093492,0.274608,1.571429,0.000000,1.0
3,0.200440,0.186984,0.274608,0.000000,0.000000,1.0
4,2.383005,1.636110,1.784950,0.897959,1.525760,1.0
...,...,...,...,...,...,...
14087,0.735752,0.819562,0.479763,1.528302,0.650951,1.0
14088,0.071202,0.000000,0.000000,0.000000,0.130190,1.0
14089,1.281633,0.601012,1.119447,2.547170,0.520761,1.0
14090,0.474679,0.218550,0.319842,1.273585,0.650951,1.0


In [41]:
stats[["PTS_R","AST_R","STL_R","BLK_R","3P_R",]] = stat_ratios[["PTS","AST","STL","BLK","3P"]]

In [42]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,W/L%,GB,PS/G,PA/G,SRS,PTS_R,AST_R,STL_R,BLK_R,3P_R
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,0.707,5.0,106.3,99.6,6.73,1.013334,0.420714,0.961127,0.673469,0.508587
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,0.707,5.0,106.3,99.6,6.73,1.614653,1.028412,1.647646,0.673469,4.577279
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,0.707,5.0,106.3,99.6,6.73,0.311795,0.093492,0.274608,1.571429,0.0
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.34,...,0.707,5.0,106.3,99.6,6.73,0.20044,0.186984,0.274608,0.0,0.0
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,0.707,5.0,106.3,99.6,6.73,2.383005,1.63611,1.78495,0.897959,1.52576


let's add our new columns to our list of predictors

In [43]:
predictors += ["PTS_R","AST_R","STL_R","BLK_R","3P_R",]

Let's test our data with our new predictors

In [44]:
mean_ap_reg, aps, all_predictions = backtest(stats, reg, years[5:], predictors)

In [45]:
mean_ap_reg

0.7208380973034985

There are other columns in the Dataframe that can help our model even though they are not numeric. Examples are "Pos" and "Tm". We can put them in cartegories and assign numbers to them

In [46]:
stats["Pos"].unique()

array(['PF', 'SG', 'SF', 'PG', 'C', 'PG-SG', 'PF-SF', 'SG-PG', 'PF-C',
       'SG-SF', 'SF-PF', 'SF-SG', 'C-PF', 'SG-PF', 'PG-SF', 'SF-C'],
      dtype=object)

In [47]:
stats["NPos"] = stats["Pos"].astype("category").cat.codes

In [48]:
stats["NTm"] = stats["Tm"].astype("category").cat.codes

In [49]:
stats.head(10)

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,PS/G,PA/G,SRS,PTS_R,AST_R,STL_R,BLK_R,3P_R,NPos,NTm
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,106.3,99.6,6.73,1.013334,0.420714,0.961127,0.673469,0.508587,2,15
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,106.3,99.6,6.73,1.614653,1.028412,1.647646,0.673469,4.577279,12,15
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,106.3,99.6,6.73,0.311795,0.093492,0.274608,1.571429,0.0,2,15
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.34,...,106.3,99.6,6.73,0.20044,0.186984,0.274608,0.0,0.0,2,15
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,106.3,99.6,6.73,2.383005,1.63611,1.78495,0.897959,1.52576,8,15
5,Larry Drew,PG,32,LAL,48,2,10.3,1.1,2.6,0.432,...,106.3,99.6,6.73,0.322931,1.16865,0.411912,0.0,1.52576,5,15
6,Magic Johnson,PG,31,LAL,79,79,37.1,5.9,12.4,0.477,...,106.3,99.6,6.73,2.160294,5.843249,1.78495,0.44898,5.085865,5,15
7,Mychal Thompson,C,36,LAL,72,4,15.0,1.6,3.2,0.496,...,106.3,99.6,6.73,0.445421,0.140238,0.411912,0.673469,0.0,0,15
8,Sam Perkins,PF,29,LAL,73,66,34.3,5.0,10.2,0.495,...,106.3,99.6,6.73,1.503297,0.70119,1.235735,2.469388,1.017173,2,15
9,Terry Teagle,SG,30,LAL,82,0,18.3,4.1,9.2,0.443,...,106.3,99.6,6.73,1.102418,0.46746,0.549215,0.22449,0.0,12,15


As the number assigned to categories in not ordered but done at random, we can't use a linear regression model with such predictor. In order to get value from these columns we use a RandomForest Regressor model

In [50]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=1, min_samples_split=5)
#predictors -= ["Ntm", "NPos",]
#predictors
mean_ap_rf, aps, all_predictions = backtest(stats, rf, years[28:] ,predictors)

In [57]:
mean_ap_rf

0.7160867351429451

In [52]:
def back_test(stats, model, year, predictors):
    aps = []  # list of average precision for each test year
    all_predictions = []    # list of predictions for each test year
    for year in years[5:]:  # loop for back testing
        train = stats[stats["Year"]< year]     # spliting training data
        test = stats[stats["Year"]==year]      #splitting test data
        model.fit(train[predictors], train["Share"])    # fitting predictors
        predictions = reg.predict(test[predictors])   #  making predictions
        predictions = pd.DataFrame(predictions, columns=['predictions'], index=test.index)    # converting predictions to dataframe
        combination = pd.concat([test[["Player","Share"]], predictions], axis = 1)
        combination = add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps)/len(aps), aps, pd.concat(all_predictions)

In [53]:
mean_ap_reg, aps, all_predictions = back_test(stats, reg, years[28:] ,predictors)

In [54]:
mean_ap_reg

0.7208380973034985

In [58]:
pd.concat([pd.Series(reg.coef_), pd.Series(predictors)], axis=1).sort_values(0,ascending=False)

Unnamed: 0,0,1
22,0.133732,BLK
21,0.053428,STL
34,0.049903,PTS_R
13,0.047623,eFG%
18,0.03606,DRB
29,0.029259,W/L%
35,0.028367,AST_R
17,0.022343,ORB
10,0.019677,2P
15,0.011051,FTA
