<h1>Scraping the Data</h1>

In [1]:
import numpy as np
import pandas as pd
import urllib.request as urllib
from bs4 import BeautifulSoup, Comment
from selenium import webdriver
from datetime import datetime
import time
import random
import os
import string

In order to create my award models, the first step is to run this notebook that scrapes all of the data that we will use. We are scraping the data from basketball-reference.com, and will be scraping by using a gecko driver that will automatically open a firefox window. The driver will visit basketball-reference.com, and visit each player profile on the website to get all of the player season data. It will also visit each award voting page on the site to get all of the award data. In total, this script will run for a few hours before it has scraped all of the data.

First, set the path of your gecko driver. I put my gecko driver file in the same directory as this project.

In [2]:
PATH=os.path.abspath(os.getcwd()) + '/geckodriver'

I will create two dataframes that will be used to store all of the data that I am scraping.

In [3]:
# get totals
player_seasons = pd.DataFrame(columns=['player', 'season', 'age', 'team', 'position', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct',
                                       'three_p', 'three_pa', 'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft',
                                       'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov','pf', 'pts', 'trp_dbl'])
player_seasons.set_index(['player', 'season'], inplace = True)

award_data = pd.DataFrame(columns=['player', 'season', 'award', 'first_place_votes', 'award_pts_won', 'award_pts_max'])
award_data.set_index(['player', 'season'], inplace = True)

<h2>Scraping Award Data</h2>
This is the code that scrapes the award data for each year. After filling in the award_data dataframe, we save this dataframe to a csv.

In [60]:
# get url of award voting results for a given year
def get_award_url(year):
    return f"https://www.basketball-reference.com/awards/awards_{year}.html"

# extract award data from beautiful soup object and add it to award_rows_list
def scrape_award_data(award_name, soup):
    #get rows of award votes from table
    awardTable = soup.find("table", {"id": award_name})
    if awardTable is None:
        awardTable = soup.find("table", {"id": f"nba_{award_name}"})
    if awardTable is not None:
        awardRows = awardTable.find("tbody").find_all("tr")
        print(f"Got rows of {award_name} players from table, starting to iterate through rows")
    
        #iterate through votes on page, filling data into award_data dataframe
        for row in awardRows:
            if row.get('class') == None:
                player_name = row.find("td", {"data-stat":"player"}).find("a").get_text()
                first_place_votes = row.find("td", {"data-stat":"votes_first"}).get_text()
                award_pts_won = row.find("td", {"data-stat":"points_won"}).get_text()
                award_pts_max = row.find("td", {"data-stat":"points_max"}).get_text()
                award_rows_list.append({'player': player_name, 'season': year, 'award': award_name,
                                                'first_place_votes': first_place_votes, 'award_pts_won': award_pts_won,
                                                'award_pts_max': award_pts_max})

                
browser = webdriver.Firefox(executable_path = PATH)

award_rows_list = []
# award voting data is available on bbref from 1956
years = range(1956, datetime.now().year)

for year in years:
    sada = browser.get(get_award_url(year))
    time.sleep(3)
    source = browser.page_source
    soup = BeautifulSoup(source, 'html.parser')
    print(f"Year: {year}")
    
    scrape_award_data('mvp', soup)
    scrape_award_data('roy', soup)
    scrape_award_data('dpoy', soup)
    scrape_award_data('smoy', soup)
    scrape_award_data('mip', soup)
   
    time.sleep(random.randint(0,1))

browser.close()
award_data = pd.DataFrame(award_rows_list, columns=['player', 'season', 'award', 'first_place_votes', 'award_pts_won', 'award_pts_max'])
award_data.set_index(['player', 'season'], inplace = True)

Year: 1956
Got rows of mvp players from table, starting to iterate through rows
Year: 1957
Got rows of mvp players from table, starting to iterate through rows
Year: 1958
Got rows of mvp players from table, starting to iterate through rows
Year: 1959
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Year: 1960
Got rows of mvp players from table, starting to iterate through rows
Year: 1961
Got rows of mvp players from table, starting to iterate through rows
Year: 1962
Got rows of mvp players from table, starting to iterate through rows
Year: 1963
Got rows of mvp players from table, starting to iterate through rows
Year: 1964
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Year: 1965
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate throug

Year: 1997
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Got rows of dpoy players from table, starting to iterate through rows
Got rows of smoy players from table, starting to iterate through rows
Got rows of mip players from table, starting to iterate through rows
Year: 1998
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Got rows of dpoy players from table, starting to iterate through rows
Got rows of smoy players from table, starting to iterate through rows
Got rows of mip players from table, starting to iterate through rows
Year: 1999
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Got rows of dpoy players from table, starting to iterate through rows
Got rows of smoy players from table, starting to iterate through 

Year: 2020
Got rows of mvp players from table, starting to iterate through rows
Got rows of roy players from table, starting to iterate through rows
Got rows of dpoy players from table, starting to iterate through rows
Got rows of smoy players from table, starting to iterate through rows
Got rows of mip players from table, starting to iterate through rows


In [62]:
award_data.to_csv('data/award_data.csv')

<h2>Scraping Player Data</h2>
This is the code that scrapes each player's statistics that they recorded each season. It loops through the player directory on basketball-reference.com, and scrapes data from each player's profile on the website. After filling in the player_seasons dataframe, we save this dataframe to a csv. WARNING: This takes a few hours to run, as there are thousands of players that we are scraping data for, and we are waiting 0-1 seconds before visiting each player's profile page (in order to avoid overwhelming basketball-reference.com).

In [None]:
# scrape all of the player basic stats data for each season

# get url of all players whose last name starts with the given letter
def get_letter_url(letter):
    return f"https://www.basketball-reference.com/players/{letter}/"


browser = webdriver.Firefox(executable_path = PATH)
player_season_list = []

for letter in string.ascii_lowercase[::-1]:
    print(f"Letter: {letter}")
    time.sleep(random.randint(0,1))
    html = urllib.urlopen(get_letter_url(letter))
    soup = BeautifulSoup(html.read())
    html.close()
    
    #get rows of players from table
    playerTable = soup.find("table", {"id":"players"})
    playerRows = playerTable.find("tbody").find_all("tr")
    
    #iterate through players on page, filling data into players dataframe
    for row in playerRows:
        if row.get('class') == None:
            player_name = player_link = row.find("th", {"data-stat":"player"}).find("a").get_text()
            player_link = row.find("th", {"data-stat":"player"}).find("a")['href']
            full_player_link = f"https://www.basketball-reference.com{player_link}"
            print(player_name)
            
            time.sleep(random.randint(0,1))
            sada = browser.get(full_player_link)
            source = browser.page_source
            player_soup = BeautifulSoup(source, 'html.parser')
            
            totalsTable = player_soup.find("table", {"id":"totals"})
            totalsRows = totalsTable.find("tbody").find_all("tr")
            time.sleep(random.randint(0,1))
            
            prev_yr = 0
            for row_t in totalsRows:
                league_soup = row_t.find("td", {"data-stat":"lg_id"})
                if league_soup is not None and league_soup.find("a") is not None:
                    league = league_soup.find("a").get_text()
                else:
                    league = 'N/A'
                if league == "NBA":
                    season_str = row_t.find("th", {"data-stat":"season"}).find("a").get_text()[0:4]
                    year = int(season_str) + 1
                    if year == prev_yr:  
                        team = row_t.find("td", {"data-stat":"team_id"}).find("a").get_text() + " "
                        update_team = player_season_list[-1]
                        update_team['team'] = update_team['team'] + team
                        player_season_list[-1] = update_team
                    else:
                        team_soup = row_t.find("td", {"data-stat":"team_id"})
                        if team_soup.find("a") is not None:
                            team = row_t.find("td", {"data-stat":"team_id"}).find("a").get_text() + " "
                        else:
                            team = ""

                        age = row_t.find("td", {"data-stat":"age"}).get_text()
                        position = row_t.find("td", {"data-stat":"pos"}).get_text()
                        g = row_t.find("td", {"data-stat":"g"}).get_text()
                        gs = row_t.find("td", {"data-stat":"gs"}).get_text()
                        mp = row_t.find("td", {"data-stat":"mp"}).get_text()
                        fg = row_t.find("td", {"data-stat":"fg"}).get_text()
                        fga = row_t.find("td", {"data-stat":"fga"}).get_text()
                        fg_pct = row_t.find("td", {"data-stat":"fg_pct"}).get_text()
                        if row_t.find("td", {"data-stat":"fg3"}) is not None:
                            three_p = row_t.find("td", {"data-stat":"fg3"}).get_text()
                        else:
                            three_p = 0
                        if row_t.find("td", {"data-stat":"fg3a"}) is not None:
                            three_pa = row_t.find("td", {"data-stat":"fg3a"}).get_text()
                        else:
                            three_pa = 0
                        if row_t.find("td", {"data-stat":"fg3_pct"}) is not None:
                            three_pct = row_t.find("td", {"data-stat":"fg3_pct"}).get_text()
                        else:
                            three_pct = 0
                        if row_t.find("td", {"data-stat":"fg2"}) is not None:
                            two_p = row_t.find("td", {"data-stat":"fg2"}).get_text()
                        else:
                            two_p = fg
                        if row_t.find("td", {"data-stat":"fg2a"}) is not None:
                            two_pa = row_t.find("td", {"data-stat":"fg2a"}).get_text()
                        else:
                            two_pa = fga
                        if row_t.find("td", {"data-stat":"fg2_pct"}) is not None:
                            two_pct = row_t.find("td", {"data-stat":"fg2_pct"}).get_text()
                        else:
                            two_pct = fg_pct
                        if row_t.find("td", {"data-stat":"efg_pct"}) is not None:
                            efg = row_t.find("td", {"data-stat":"efg_pct"}).get_text()
                        else:
                            efg = fg_pct
                        ft = row_t.find("td", {"data-stat":"ft"}).get_text()
                        fta = row_t.find("td", {"data-stat":"fta"}).get_text()
                        ft_pct = row_t.find("td", {"data-stat":"ft_pct"}).get_text()
                        if row_t.find("td", {"data-stat":"orb"}) is not None:
                            orb = row_t.find("td", {"data-stat":"orb"}).get_text()
                        else:
                            orb = ''
                        if row_t.find("td", {"data-stat":"drb"}) is not None:
                            drb = row_t.find("td", {"data-stat":"drb"}).get_text()
                        else:
                            drb = ''
                        trb = row_t.find("td", {"data-stat":"trb"}).get_text()
                        ast = row_t.find("td", {"data-stat":"ast"}).get_text()
                        if row_t.find("td", {"data-stat":"stl"}) is not None:
                            stl = row_t.find("td", {"data-stat":"stl"}).get_text()
                        else:
                            stl = ''
                        if row_t.find("td", {"data-stat":"blk"}) is not None:
                            blk = row_t.find("td", {"data-stat":"blk"}).get_text()
                        else:
                            blk = ''
                        if row_t.find("td", {"data-stat":"tov"}) is not None:
                            tov = row_t.find("td", {"data-stat":"tov"}).get_text()
                        else:
                            tov = ''
                        pf = row_t.find("td", {"data-stat":"pf"}).get_text()
                        pts = row_t.find("td", {"data-stat":"pts"}).get_text()
                        trp_dbl_soup = row_t.find("td", {"data-stat":"trp_dbl"})
                        if trp_dbl_soup is None:
                            trp_dbl = ''
                        else:
                            trp_dbl = trp_dbl_soup.get_text()

                        player_season_list.append({'player': player_name, 'season': year, 'age': age, 'team': team.strip(),
                                                   'position': position, 'g': g, 'gs': gs, 'mp': mp, 'fg': fg, 'fga': fga,
                                                   'fg_pct': fg_pct, 'three_p': three_p, 'three_pa': three_pa,
                                                   'three_pct': three_pct, 'two_p': two_p, 'two_pa': two_pa, 'two_pct': two_pct,
                                                   'efg': efg, 'ft': ft, 'fta': fta, 'ft_pct': ft_pct, 'orb': orb, 'drb': drb,
                                                   'trb': trb, 'ast': ast, 'stl': stl, 'blk': blk, 'tov': tov,'pf': pf,
                                                   'pts': pts, 'trp_dbl': trp_dbl})
                    prev_yr = year


browser.close()

player_seasons = pd.DataFrame(player_season_list, columns=['player', 'season', 'age', 'team', 'position', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct',
                                       'three_p', 'three_pa', 'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft',
                                       'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov','pf', 'pts', 'trp_dbl'])
player_seasons.set_index(['player', 'season'], inplace = True)

In [None]:
player_seasons

In [None]:
player_seasons.to_csv('data/player_seasons.csv')