## NBA Regular Season MVP Predictor

### This program will:
Predict the regular season MVP using previous winners as well as players stats from past NBA seasons

Making a list of years that we will be able to iterate over to generate data for each season:

In [1]:
years = list(range(1991,2023))

In [2]:
url_start = 'https://www.basketball-reference.com/awards/awards_{}.html'

Downloading data

In [3]:
import requests
import numpy as np
import time

for year in years:
    url = url_start.format(year)
    data = requests.get(url)

    # lag = np.random.uniform(low=5,high=25)
    lag = 5

    with open('mvp/{}.html'.format(year),'w+') as f:
        f.write(data.text)
    
    print(f'Data from the year {year} has been downloaded')
    print(f'Waiting {lag} seconds until next download attempt')

    time.sleep(lag)

Data from the year 1991 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1992 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1993 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1994 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1995 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1996 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1997 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1998 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 1999 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 2000 has been downloaded
Waiting 5 seconds until next download attempt
Data from the year 2001 has been downloaded
Waiting 5 seconds until next download attempt
Data from 

Cleaning data

In [4]:
from bs4 import BeautifulSoup # imports parser class

In [5]:
with open ('mvp/1991.html') as f:
    page = f.read()

In [6]:
soup = BeautifulSoup(page, 'html.parser') # creates parser class where we can extract the table from the html file

In [7]:
soup.find('tr', class_='over_header').decompose() # this is getting rid of the header on the table that we are trying to analyze

In [8]:
mvp_table = soup.find(id='mvp')

Now we are combining all of the above steps and creating dataframes for our data

In [35]:
import pandas as pd

In [10]:
mvp_1991 = pd.read_html(str(mvp_table))[0]

In [55]:
dfs = []
for year in years:
    with open (f'mvp/{year}.html') as f:
        page = f.read()
    soup = BeautifulSoup(page, 'html.parser') # creates parser class where we can extract the table from the html file
    soup.find('tr', class_='over_header').decompose() # this is getting rid of the header on the table that we are trying to analyze
    mvp_table = soup.find(id='mvp')
    mvp = pd.read_html(str(mvp_table))[0]
    mvp['Year'] = year # add in additional column to show what year each players stats are from
    dfs.append(mvp)

The for loop above is going to create a big list consisting of many dataframes

In order to visualize our MVP data, we need to concatenate our dataframes into one big DF

In [56]:
mvps = pd.concat(dfs)

In [57]:
mvps.tail()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
7,8,Stephen Curry,33,GSW,0.0,4.0,1000,0.004,64,34.5,...,5.2,6.3,1.3,0.4,0.437,0.38,0.923,8.0,0.173,2022
8,9,Chris Paul,36,PHO,0.0,2.0,1000,0.002,65,32.9,...,4.4,10.8,1.9,0.3,0.493,0.317,0.837,9.4,0.21,2022
9,10T,DeMar DeRozan,32,CHI,0.0,1.0,1000,0.001,76,36.1,...,5.2,4.9,0.9,0.3,0.504,0.352,0.877,8.8,0.154,2022
10,10T,Kevin Durant,33,BRK,0.0,1.0,1000,0.001,55,37.2,...,7.4,6.4,0.9,0.9,0.518,0.383,0.91,8.4,0.198,2022
11,10T,LeBron James,37,LAL,0.0,1.0,1000,0.001,56,37.2,...,8.2,6.2,1.3,1.1,0.524,0.359,0.756,7.5,0.172,2022


In [58]:
mvps.to_csv('mvps.csv')

We now have all of the MVP voting stats from previous years

In order to make predictions for a future MVP winner, we need stats on all current players

Now it is time to gather information from all players from 1991 till now and then we will map the votes to the player data and train a machine learning model

In [18]:
player_stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'

url = player_stats_url.format(1991)
data = requests.get(url)
with open('player/1991.html','w+') as f:
    f.write(data.text)

The code that we wrote above is able to download the html file of the website containing the stats for each player from a given season with one caveat...

**The html file will only render the stats for the first 20 player in that season sorted alphabetically.**

This is because the html file uses javascript to load the next 20 players and so on when you scroll down using a web browser (client side)

We have to come up with a way to bypass this

To get past the issue of the html file not loading all of the necessary player data, we have to use selenium

Use `pip install selenium` in the terminal

We also have to download a chrome driver in order to use this package

This extension will allow us to use python to navigate our web browser meaning we can tell python that we need to scroll down in the webpage in order to load more player data

In [24]:
from selenium import webdriver

In [25]:
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')

  driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')


After running the line of code above, a new chrome window should appear

This window will appear as a seperate app than the normal chrome app

The new chrome window is being controlled by Selenium

We will now write a code to tell the browser to render the all player stats page so that we get all the necessary data we need

In [53]:
year = 1991
url = player_stats_url.format(year)


driver.get(url)
driver.execute_script('window.scrollTo(1,10000)')

time.sleep(2)

html = driver.page_source

In the code above, we are telling the driver to get the basketballreference website that shows the players stats from a given year, in this case 1991

We are then telling the driver to run a javascript script

This is done by using the `.execute_script` line

The script that we want the driver to run is the `window.scrollTo` command which is within javascript and allows the driver to scroll all the way down on the page so that all of the data that we need to collect is rendered

The `sleep` command is just telling the program to wait until we have executed the javascript

In this case, we are sleeping for 2 seconds

The last line is just saving the html data to a variable to be used later

In [54]:
with open('mvp/player/{}.html'.format(year), 'w+') as f:
    f.write(html)

We just saved the new html file to the 1991 file from before but this time it contains all of the player data that we need

Now we will write a for loop that is combining everything that we have done

In [31]:
for year in years:
    url = player_stats_url.format(year)

    driver.get(url)
    driver.execute_script('window.scrollTo(1,10000)')

    time.sleep(2)

    html = driver.page_source
    
    with open('mvp/player/{}.html'.format(year), 'w+') as f:
        f.write(html)

We have now collected all player data from years 1991 to 2022

Now it is time to clean up our data like last time by removing some of the headers that we don't need, especially the headers that repeat themselves on basketballreference

We are going to use the same parsing code that we wrote earlier to remove the headers from the MVP voting pages with some minor changes

One change that we are making is changing the class that we want our soup to find

We want it to find any references of the class `'thead'` rather than `'over_header'` because that is the name of the class for the header that we are trying to remove

We will also change the id that we want our soup to look for from `'mvp'` to `'per_game_stats'`

In [76]:
dfs = []
for year in years:
    with open('player/{}.html'.format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser') # creates parser class where we can extract the table from the html file
    soup.find('tr', class_="thead").decompose() # this is getting rid of the header on the table that we are trying to analyze
    player_table = soup.find(id='per_game_stats')
    player = pd.read_html(str(player_table))[0]
    player['Year'] = year # add in additional column to show what year each players stats are from
    dfs.append(player)

In [82]:
players = pd.concat(dfs)

In [85]:
players

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0,6.7,1.3,2.7,...,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1991
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19,22.5,6.2,15.1,...,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1991
2,3,Mark Acres,C,28,ORL,68,0,19.3,1.6,3.1,...,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1991
3,4,Michael Adams,PG,28,DEN,66,66,35.5,8.5,21.5,...,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1991
4,5,Mark Aguirre,SF,31,DET,78,13,25.7,5.4,11.7,...,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1991
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
836,601,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,...,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3,2022
837,602,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,...,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4,2022
838,603,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,...,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3,2022
839,604,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,...,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2,2022


In [86]:
player.to_csv('players.csv')

We now have a single csv file where each row is single players stats

All players form 1991 to 2022 are listed in the csv file

Another important factor in determining the MVP is the teams record

We need to get the records of each team for each season using a similar process

In [87]:
team_stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_standings.html'

In [91]:
for year in years:
    url = team_stats_url.format(year)

    data = requests.get(url)

    with open('team/{}.html'.format(year), 'w+') as f:
        f.write(data.text)

We now have the standings pages saved to their respective year

Now we have to extract only the divisional standings
* We have to extract the divisional to get team standings stats because the other tables on the website are not easy to manipulate using python

In [113]:
dfs = []

for year in years:
    with open('mvp/team/{}.html'.format(year)) as f:
        page = f.read()
    soup = BeautifulSoup(page, 'html.parser') # creates parser class where we can extract the table from the html file
    soup.find('tr', class_="thead").decompose() # this is getting rid of the header on the table that we are trying to analyze
    team_table = soup.find(id='divs_standings_E')
    team = pd.read_html(str(team_table))[0]
    team = team[~team['W'].str.contains('Division')] # this is saying to negate, '~', any line that contains the word division 
    team['Year'] = year # add in additional column to show what year each players stats are from
    team['Team'] = team['Eastern Conference']
    del team['Eastern Conference'] # this is getting rid of the column for eastern conference so that all of the teams are listed in a single column
    dfs.append(team)

    # this is doing the same thing but for the western conference teams since the tables are broken up
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose
    team_table = soup.find(id='divs_standings_W')
    team = pd.read_html(str(team_table))[0]
    team = team[~team['W'].str.contains('Division')] # this is saying to negate, '~', any line that contains the word division 
    team['Year'] = year
    team['Team'] = team['Western Conference']
    del team['Western Conference']
    dfs.append(team)


In [114]:
teams = pd.concat(dfs)
teams[:10]

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
0,56,26,0.683,—,111.5,105.7,5.22,1991,Boston Celtics*
1,44,38,0.537,12.0,105.4,105.6,-0.39,1991,Philadelphia 76ers*
2,39,43,0.476,17.0,103.1,103.3,-0.43,1991,New York Knicks*
3,30,52,0.366,26.0,101.4,106.4,-4.84,1991,Washington Bullets
4,26,56,0.317,30.0,102.9,107.5,-4.53,1991,New Jersey Nets
5,24,58,0.293,32.0,101.8,107.8,-5.91,1991,Miami Heat
7,61,21,0.744,—,110.0,101.0,8.57,1991,Chicago Bulls*
8,50,32,0.61,11.0,100.1,96.8,3.08,1991,Detroit Pistons*
9,48,34,0.585,13.0,106.4,104.0,2.33,1991,Milwaukee Bucks*
10,43,39,0.524,18.0,109.8,109.0,0.72,1991,Atlanta Hawks*


In [115]:
teams.to_csv('teams.csv')

We have now successfully download MVP voting data, player standings and team standings

We have all of the data we need to build our machine learning model