<h1> NBA MVP Prediction Model <h1/>
Webscraping

![](logo.png)

## Webscraping for MVP rankings 
I will use requests and BeautifuSoup to get the MVP rankings table from basketball-reference.com

In [4]:
years = list(range(1991,2023))

In [5]:
url_start = 'https://www.basketball-reference.com/awards/awards_{}.html'

In [6]:
import requests

for year in years:
    url = url_start.format(year)
    data = requests.get(url)
    
    #Save the request to a local folder.
    with open('mvp/{}.html'.format(year), 'w+') as f:
        f.write(data.text)

In [27]:
from bs4 import BeautifulSoup 

In [8]:
with open('mvp/1991.html') as f:
    page = f.read()

In [9]:
soup = BeautifulSoup(page, 'html.parser')

In [10]:
#decompose/remove the extra header ontop of the 
soup.find('tr',  class_='over_header').decompose()

In [11]:
mvp_table = soup.find(id = 'mvp')

In [12]:
import pandas as pd

In [13]:
mvp_1991 = pd.read_html(str(mvp_table))[0]

In [14]:
dfs = []

for year in years:
    with open('mvp/1991.html'.format(year)) as f:
        page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr',  class_='over_header').decompose()
    mvp_table = soup.find(id = 'mvp')
    mvp = pd.read_html(str(mvp_table))[0]
    mvp['Year'] = year
    dfs.append(mvp)

In [15]:
mvps = pd.concat(dfs)

In [16]:
mvps.to_csv('mvps.csv')

In [23]:
player_stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'

url = player_stats_url.format(1991)
data = requests.get(url)
with open('players/1991.html', 'w+') as f:
    f.write(data.text)

## Webscraping for All Player Stats 

To do this section of the webscraping I installed selenium since basketball reference has the page formatted with javascript elements which do not show up completelty in the scraped tables

In [3]:
from selenium import webdriver

In [31]:
driver = webdriver.Chrome(executable_path='/Users/macuser/Downloads/chromedriver_mac64/chromedriver')

  driver = webdriver.Chrome(executable_path='/Users/macuser/Downloads/chromedriver_mac64/chromedriver')


In [24]:
import time

year = 1991
url = player_stats_url.format(year)

driver.get(url)
driver.execute_script('window.scrollTo(1,1000)')
time.sleep(2)

html = driver.page_source 

In [27]:
with open('players/{}.html'.format(year), 'w+') as f:
    f.write(html)

In [32]:
for year in years:
    url = player_stats_url.format(year)

    driver.get(url)
    driver.execute_script('window.scrollTo(1,1000)')
    time.sleep(2)

    html = driver.page_source
    with open('players/{}.html'.format(year), 'w+') as f:
        f.write(html)

In [38]:
dfs = []
for year in years:
    
    with open('players/{}.html'.format(year)) as f:
        page = f.read()

    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr',  class_='thead').decompose()
    player_table = soup.find(id = 'per_game_stats')
    player = pd.read_html(str(player_table))[0]
    player['Year'] = year
    dfs.append(player)

In [44]:
players= pd.concat(dfs)

In [45]:
player.tail()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
836,601,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,...,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3,2022
837,602,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,...,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4,2022
838,603,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,...,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3,2022
839,604,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,...,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2,2022
840,605,Ivica Zubac,C,24,LAC,76,76,24.4,4.1,6.5,...,2.9,5.6,8.5,1.6,0.5,1.0,1.5,2.7,10.3,2022


In [46]:
player.to_csv('players.csv')

## Team Record Per Year Scraping

In [1]:
team_stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_standings.html'

In [18]:
for year in years:
    url = team_stats_url.format(year)
    data = requests.get(url)

    with open('team/{}.html'.format(year), 'w+') as f:
            f.write(data.text)

There are two tables that need to be scraped and combined. 

In [42]:
dfs = []
for year in years:
    with open("team/{}.html".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose()
    e_table = soup.find_all(id="divs_standings_E")[0]
    e_df = pd.read_html(str(e_table))[0]
    e_df["Year"] = year
    e_df["Team"] = e_df["Eastern Conference"]
    del e_df["Eastern Conference"]
    dfs.append(e_df)
    
    w_table = soup.find_all(id="divs_standings_W")[0]
    w_df = pd.read_html(str(w_table))[0]
    w_df["Year"] = year
    w_df["Team"] = w_df["Western Conference"]
    del w_df["Western Conference"]
    dfs.append(w_df)

In [43]:
teams = pd.concat(dfs)

In [45]:
teams.head()

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
0,56,26,0.683,—,111.5,105.7,5.22,1991,Boston Celtics*
1,44,38,0.537,12.0,105.4,105.6,-0.39,1991,Philadelphia 76ers*
2,39,43,0.476,17.0,103.1,103.3,-0.43,1991,New York Knicks*
3,30,52,0.366,26.0,101.4,106.4,-4.84,1991,Washington Bullets
4,26,56,0.317,30.0,102.9,107.5,-4.53,1991,New Jersey Nets
