#Rookie of the Year Predictor for the NBA

Every year, the NBA has end of season rewards. One of the biggest rewards is the Rookie of the Year (ROTY), where the most outstanding first-year player is recognized for their success throughout the season.

This award has been decied by the media in the most recent years and although there may be some media sentiment that plays into it, more often than not the pick for ROTY is competitve. This algorithm is trained upon recent candidates and victors and utilizes this data in order to predict the winner out of a list of candidates

##Importing packages

requests, BeautifulSoup and StringIO are utilized to scrape the data from reference website (basketball-reference.com)

pandas and numpy are utilized to extract and construct the datasets that are being used for training and testing

datimetime and time are utilized for additional data used for analysis as well as slowing down the scraping process to avoid being blocked by the website.

sklearn is utilized for the building of the predicition algorithm.

In [21]:
#Import packages

import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO
from datetime import datetime
import time
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score


##Create a data frame of ROTY candidates and there stats by scraping data from Basketball Reference.

In [22]:
def wait5secs():
    time.sleep(5) #creating a function to wait every 5 seconds when scraping the website to avoid getting blocked

df = pd.DataFrame()
for i in range(1975, 2023): #defining a range to scrape ROTY data from 1975-2023 as training set
    url = f"https://www.basketball-reference.com/awards/awards_{i}.html"
    response = requests.get(url) #GET response from the URL
    response.raise_for_status() #check response to ensure it was successful
    soup = BeautifulSoup(response.content, "html.parser") #parser for html
    table = soup.find_all("table", {"class": "sortable stats_table", "id": "roy"}) #extract data from the ROTY table
    if len(table) ==0:
        table = soup.find_all("table", {"class": "sortable stats_table", "id": "nba_roy"})
    else:
        pass
    df_temp = pd.read_html(StringIO(str(table)), header=1)[0] #create temp table to append to main table
    df_temp['Year'] = i #create column for year to distinguish which year the data is sourced from
    df = pd.concat([df, df_temp]) #append to main table
    wait5secs()


df = df.select_dtypes(exclude=['object']) #exlcude catgorical variables such as name and team as they are irrelvant to ROTY
df = df.drop(["First","Pts Won", "Pts Max"], axis = 1 ) #drop variable directly related to target variable

##Choosing Feature Variables
We used the correalation matrix to see which stats have a correlation with ROTY scoring to prep the model for training.

In [23]:
correlation_matrix = df[df.columns[0:]].corr()['Share'][:] #check correlation matrix against target variable

stats = correlation_matrix[correlation_matrix > 0.2].index.to_list() #extract relevant variables from correlation matrix
stats

['Share', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'WS', 'WS/48']

##Preprocessing Training Data

In [24]:
feature_cols = stats 
feature_cols.remove("Share") #create feature variables list

x = df[feature_cols] #create test data set
x = x.fillna(0)
y = df.Share
d = preprocessing.normalize(x) 
scaled_df = pd.DataFrame(d, columns=feature_cols) #normalize data to prep for training

#Repeat the same process for scraping data for the test data which is going to be the 2023 ROTY candidates.

In [25]:
url = f"https://www.basketball-reference.com/awards/awards_{2023}.html"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find_all("table", {"class": "sortable stats_table", "id": "roy"})
if len(table) ==0:
    table = soup.find_all("table", {"class": "sortable stats_table", "id": "nba_roy"})
else:
    pass
df_temp = pd.read_html(StringIO(str(table)), header=1)[0]
df_temp['Year'] = 2023

player = df_temp["Player"]

##Train data set using the RandomForestRegressor.

In [26]:
rf = RandomForestRegressor(n_estimators=1000, random_state=30, oob_score=True) #train model
rf.fit(scaled_df, y) #train dataset using the scaled test data

##Create test data set and preidicitons

In [27]:
x_test = df_temp[feature_cols] #create dataset using feature columns
x_test = x_test.fillna(0) #fill in columns that have null with 0

d_test = preprocessing.normalize(x_test) #normalize the test data to prep it for applying it to the mdodel
scaled_df_test = pd.DataFrame(d_test, columns=feature_cols) #dataframe from test dataset

y_pred = rf.predict(scaled_df_test) #apply test data to model

##Prep Final Dataset with prediction

In [28]:
outcome = pd.DataFrame({
    'Player': player,
    'Share_pred':y_pred})
outcome['Ranking'] = outcome['Share_pred'].rank(ascending = False).astype("int")
outcome = outcome.sort_values("Ranking")
outcome["Original_rank"]=outcome.index+1 #prep final dataframe with rankings based on prediction


In [29]:
actual_y = outcome["Original_rank"] 
y_pred = outcome["Ranking"]
mse = mean_squared_error(actual_y, y_pred)
print(mse)
r2 = r2_score(actual_y, y_pred)
print(r2) #Test data to see model success both with r2 and mean_squared_error scores

3.0
-0.02857142857142847


##Final Results

In [30]:
outcome #print results

Unnamed: 0,Player,Share_pred,Ranking,Original_rank
0,Paolo Banchero,0.553899,1,1
3,Bennedict Mathurin,0.409872,2,4
2,Walker Kessler,0.336074,3,3
5,Jaden Ivey,0.283399,4,6
1,Jalen Williams,0.155926,5,2
4,Keegan Murray,0.155718,6,5
