To tackle the task, it was first necessary to collect all the data that may be relevant to the problem (while at the same time being relatively easily available). It was clear from the start that it would not be enough to only look at one season’s data, as the sample constructed this way would be too small to build effective predictive models on. At the same time, looking back too far into the past could introduce a significant bias, as the basketball changed over the years and so did the most important factors that influenced players’ desirability in the draft (e.g. three-point shooting is a fundamental skill in a basketball player’s toolkit nowadays, whereas 20 years ago it was not nearly as important). When collecting data, it was necessary to find the right balance between the sample size, and its “expiration date”.

As it turned out, a somewhat reasonable solution to this problem was forced upon the author of this project by the data availability itself. Because college basketball page of the Sports Reference portal provides an exhaustive dataset of all the players that played in the NCAA Division I reaching back many decades into the past, it was selected as the sole data source for this project. However, advanced performance measures (such as rebound percentage or win shares per 40 minutes) were not collected before the 2010/11 season. Therefore, as of January 2021, there were 10 full seasons of complete data available to be fetched, and it was decided to download data from all these years.

The Sports Reference page was not equipped with an API that would facilitate the data collection, so a web crawler had to be developed to serve that function. It was built using Python and its two third-party libraries for HTML and website interaction automation: requests and selenium. A headless browser (Chromium) was used to fetch content that is generated upon loading the website. The data was downloaded in batches of 50 players at the time and saved in csv files to be later merged together. The progress (i.e. players and schools whose data was already downloaded) was kept tracked of using the checked_players.txt and checked_schools.txt text files.

The whole process was achieved with the folllowing code.

In [1]:
import requests
import glob
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [2]:
def initialize_browser():
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=chrome_options)
    return driver

In [3]:
def get_contents(url, js=False, driver=None):
    if js:
        driver.get(url)
        contents = driver.page_source
    else:
        r = requests.get(url)
        contents = r.content
    return BeautifulSoup(contents)

In [4]:
def strip_name(name):
    if 'NCAA' in name:
        return name[:-4].strip()
    return name

In [5]:
def find_rows(soup, _id):
    table = soup.find('table', {'id': _id}).find('tbody')
    rows = table.find_all('tr', {'class': None})
    return rows

In [6]:
def get_school_year(years_played):
    if years_played >= 4:
        return 'senior'
    elif years_played == 3:
        return 'junior'
    elif years_played == 2:
        return 'sophomore'
    return 'freshman'

In [7]:
def clear_dictionary(_dict):
    return {key: [] for key in _dict.keys()}

In [8]:
def read_checked(text_file):
    with open(text_file, 'r') as file:
        lines = file.readlines()
        return [line.strip() for line in lines]

In [9]:
def save_checked(item_list, text_file):
    with open(text_file, 'w') as file:
        for item in item_list:
            file.write(item + "\n")

In [10]:
def make_checkpoint(players_stats_dict, checked_players, checked_schools, filename='final_batch'):
    temp_df = pd.DataFrame(player_stats_dict)
    temp_df.to_csv(f'raw data/players data/{filename}.csv')
    save_checked(checked_players, text_file='web scraper files/checked_players.txt')
    save_checked(checked_schools, text_file='web scraper files/checked_schools.txt')
    return clear_dictionary(player_stats_dict)

In [11]:
START = 2011
END = 2021
BASE_URL = 'https://www.sports-reference.com/'

In [39]:
school_stats_dict = {'season': [], 'school_url': [], 'school_name': [], 'games_played': [],
                     'win_pct': [], 'ncaa_tournament': [], 'pace': [], 'eff_fg_pct': []}

for season in (range(START, END+1)):
    school_stats_url = f'cbb/seasons/{season}-school-stats.html'
    advanced_school_stats_url = f'cbb/seasons/{season}-advanced-school-stats.html'

    basic_school_stats_soup = get_contents(BASE_URL + school_stats_url)
    advanced_school_stats_soup = get_contents(BASE_URL + advanced_school_stats_url)

    basic_school_stats_rows = find_rows(basic_school_stats_soup, 'basic_school_stats')
    advanced_school_stats_rows = find_rows(advanced_school_stats_soup, 'adv_school_stats')

    for b_row, a_row in zip(basic_school_stats_rows, advanced_school_stats_rows):
        school_url = BASE_URL + b_row.find('a')['href']

        school_name = b_row.find('td', {'data-stat': 'school_name'}).text
        games_played = b_row.find('td', {'data-stat': 'g'}).text
        win_pct = b_row.find('td', {'data-stat': 'win_loss_pct'}).text or 0
        ncaa_tournament = ('NCAA' in school_name)

        pace = a_row.find('td', {'data-stat': 'pace'}).text or 0
        eff_fg_pct = a_row.find('td', {'data-stat': 'efg_pct'}).text or 0

        school_stats_dict['season'].append(season)
        school_stats_dict['school_url'].append(school_url)
        school_stats_dict['school_name'].append(strip_name(school_name))
        school_stats_dict['games_played'].append(int(games_played))
        school_stats_dict['win_pct'].append(float(win_pct))
        school_stats_dict['ncaa_tournament'].append(int(ncaa_tournament))
        school_stats_dict['pace'].append(float(pace))
        school_stats_dict['eff_fg_pct'].append(float(eff_fg_pct))

In [41]:
school_ratings_dict = {'season': [], 'school_name': [], 'srs_off': [], 'srs_def': [], 'sos': [],
                       'team_ppg': [], 'opp_ppg': [], 'off_rating': [], 'def_rating': []}

for season in (range(START, END+1)):
    school_ratings_url = f'cbb/seasons/{season}-ratings.html'
    school_ratings_soup = get_contents(BASE_URL + school_ratings_url)
    school_ratings_rows = find_rows(school_ratings_soup, 'ratings')

    for row in school_ratings_rows:
        school_name = row.find('td', {'data-stat': 'school_name'}).text
        srs_off = row.find('td', {'data-stat': 'srs_off'}).text or 0
        srs_def = row.find('td', {'data-stat': 'srs_def'}).text or 0
        sos = row.find('td', {'data-stat': 'sos'}).text or 0
        team_ppg = row.find('td', {'data-stat': 'pts_per_g'}).text or 0
        opp_ppg = row.find('td', {'data-stat': 'opp_pts_per_g'}).text or 0
        off_rating = row.find('td', {'data-stat': 'off_rtg'}).text or 0
        def_rating = row.find('td', {'data-stat': 'def_rtg'}).text or 0

        school_ratings_dict['season'].append(season)
        school_ratings_dict['school_name'].append(school_name)
        school_ratings_dict['srs_off'].append(float(srs_off))
        school_ratings_dict['srs_def'].append(float(srs_def))
        school_ratings_dict['sos'].append(float(sos))
        school_ratings_dict['team_ppg'].append(team_ppg)
        school_ratings_dict['opp_ppg'].append(opp_ppg)
        school_ratings_dict['off_rating'].append(off_rating)
        school_ratings_dict['def_rating'].append(def_rating)

In [42]:
school_poll_dict = {'season': [], 'school_name': [], 'rank': []}

for season in (range(START, END+1)):
    school_poll_url = f'cbb/seasons/{season}-polls.html'
    school_poll_soup = get_contents(BASE_URL + school_poll_url)
    school_poll_rows = find_rows(school_poll_soup, 'ap-polls')

    for row in school_poll_rows:
        school_name = row.find('th', {'data-stat': 'school'}).text
        rank = row.find_all('td')[-1].text

        school_poll_dict['season'].append(season)
        school_poll_dict['school_name'].append(school_name)
        school_poll_dict['rank'].append(rank)

In [43]:
school_stats_df = pd.DataFrame(school_stats_dict)
school_ratings_df = pd.DataFrame(school_ratings_dict)
school_polls_df = pd.DataFrame(school_poll_dict)
schools_df = pd.merge(school_stats_df, school_ratings_df, on=['season', 'school_name'])
schools_df = pd.merge(schools_df, school_polls_df, on=['season', 'school_name'], how='left')
schools_df.to_csv('raw_data/schools_data.csv')

In [12]:
schools_df = pd.read_csv('raw_data/schools_data.csv', index_col=0)

In [13]:
player_stats_dict = {'name': [], 'season': [], 'school_url': [], 'position': [], 'height': [], 'weight': [], 'school_year': [],
                     'games_played': [], 'games_started': [], 'games_won': [], 'mpg': [], 'fg_pct': [], 'fg3_pct': [],
                     'fg3a': [],'ft_pct': [], 'fta': [], 'off_rpg': [], 'def_rpg': [], 'apg': [], 'spg': [], 'bpg': [],
                     'tpg': [], 'fpg': [], 'ppg': [], 'per': [], 'ts_pct': [], 'eff_fg_pct': [], 'fg3a_rate': [],
                     'off_reb_pct': [], 'def_reb_pct': [], 'ast_pct': [], 'usg_pct': [], 'win_shares_40_min': [],
                     'plus_minus': [], 'max_points': [], 'max_assists': [], 'max_steals': [], 'max_blocks': [],
                     'max_rebounds': [], 'std_points': [], 'std_assists': [], 'std_steals': [], 'std_blocks': [],
                     'std_rebounds': [], 'draft': []}

In [14]:
driver = initialize_browser()

In [17]:
checked_players = read_checked(text_file='web scraper files/checked_players.txt')
checked_schools = read_checked(text_file='web scraper files/checked_schools.txt')

n_files = len(list(glob.iglob('raw_data/players_data/*.csv')))
n_iterations = 50

for season, school_url in zip(schools_df['season'], schools_df['school_url']):
    if school_url not in checked_schools:
        school_page = get_contents(school_url)

        try:
            roster_table = school_page.find('table', {'id': 'roster'}).find('tbody')
            players = roster_table.find_all('tr')
        except AttributeError:
            continue

        for player in players:
            player_url = BASE_URL + player.find('a')['href']

            if player_url not in checked_players:
                name = player.find('a').text
                position = player.find('td', {'data-stat': 'pos'}).text
                height = player.find('td', {'data-stat': 'height'}).text
                weight = player.find('td', {'data-stat': 'weight'}).text

                player_page = get_contents(player_url, js=True, driver=driver)

                per_game_rows = find_rows(player_page, 'players_per_game')

                years_played = len(per_game_rows)

                per_game_last_season = per_game_rows[-1]

                season = '20' + per_game_last_season.find('th', {'data-stat': 'season'}).find('a').text[-2:]
                try:
                    season_school_url = BASE_URL + per_game_last_season.find('td', {'data-stat': 'school_name'}).find('a')['href']
                except TypeError:
                    continue
                    
                try:
                    games_played = per_game_last_season.find('td', {'data-stat': 'g'}).text
                    games_started = per_game_last_season.find('td', {'data-stat': 'gs'}).text
                    mpg = per_game_last_season.find('td', {'data-stat': 'mp_per_g'}).text
                    fg_pct = per_game_last_season.find('td', {'data-stat': 'fg_pct'}).text
                    fg3_pct = per_game_last_season.find('td', {'data-stat': 'fg3_pct'}).text
                    fg3a = per_game_last_season.find('td', {'data-stat': 'fg3a_per_g'}).text
                    ft_pct = per_game_last_season.find('td', {'data-stat': 'ft_pct'}).text
                    fta = per_game_last_season.find('td', {'data-stat': 'fta_per_g'}).text
                    off_rpg = per_game_last_season.find('td', {'data-stat': 'orb_per_g'}).text
                    def_rpg = per_game_last_season.find('td', {'data-stat': 'drb_per_g'}).text
                    apg = per_game_last_season.find('td', {'data-stat': 'ast_per_g'}).text
                    spg = per_game_last_season.find('td', {'data-stat': 'stl_per_g'}).text
                    bpg = per_game_last_season.find('td', {'data-stat': 'blk_per_g'}).text
                    tpg = per_game_last_season.find('td', {'data-stat': 'tov_per_g'}).text
                    fpg = per_game_last_season.find('td', {'data-stat': 'pf_per_g'}).text
                    ppg = per_game_last_season.find('td', {'data-stat': 'pts_per_g'}).text
                except AttributeError:
                    continue
                    
                advanced_rows = find_rows(player_page, 'players_advanced')
                advanced_last_season = advanced_rows[-1]

                per = advanced_last_season.find('td', {'data-stat': 'per'}).text
                ts_pct = advanced_last_season.find('td', {'data-stat': 'ts_pct'}).text
                eff_fg_pct = advanced_last_season.find('td', {'data-stat': 'efg_pct'}).text
                fg3a_rate = advanced_last_season.find('td', {'data-stat': 'fg3a_per_fga_pct'}).text
                off_reb_pct = advanced_last_season.find('td', {'data-stat': 'orb_pct'}).text
                def_reb_pct = advanced_last_season.find('td', {'data-stat': 'drb_pct'}).text
                ast_pct = advanced_last_season.find('td', {'data-stat': 'ast_pct'}).text
                usg_pct = advanced_last_season.find('td', {'data-stat': 'usg_pct'}).text
                win_shares_40_min = advanced_last_season.find('td', {'data-stat': 'ws_per_40'}).text
                plus_minus = advanced_last_season.find('td', {'data-stat': 'bpm'}).text
                
                try:
                    all_links = player_page.find_all('a', href=True)
                    for link in all_links:
                        if link.text == 'Basketball-Reference.com':
                            external_url = link['href']
                            external_soup = get_contents(external_url)
                            external_info = external_soup.find('div', {'id': 'info'}).find_all('p')
                            for item in external_info:
                                if 'NBA Draft' in item.text:
                                    draft = int(item.text[-30:-28].strip())
                                    break
                            else:
                                draft = 'undrafted (NBA)'
                            break
                    else:
                        draft = 'undrafted'
                except AttributeError:
                    continue


                gamelog_url = player_url[:-5] + f'/gamelog/{season}/'
                gamelog_page = get_contents(gamelog_url, js=True, driver=driver)
                
                try:
                    gamelog_table = gamelog_page.find('table', {'id': 'gamelog'}).find('tbody')
                    games = gamelog_table.find_all('tr', {'class': None})
                except AttributeError:
                    continue 
                    
                points = []
                assists = []
                rebounds = []
                steals = []
                blocks = []
                games_won = 0 

                for game in games:
                    pts = game.find('td', {'data-stat': 'pts'}).text
                    ast = game.find('td', {'data-stat': 'ast'}).text
                    rebs = game.find('td', {'data-stat': 'trb'}).text
                    stl = game.find('td', {'data-stat': 'stl'}).text
                    blk = game.find('td', {'data-stat': 'blk'}).text
                    win = game.find('td', {'data-stat': 'game_result'}).text == 'W'

                    points.append(int(pts))
                    assists.append(int(ast))
                    rebounds.append(int(rebs))
                    steals.append(int(stl))
                    blocks.append(int(blk))
                    games_won += int(win)

                player_stats_dict['name'].append(name)
                player_stats_dict['season'].append(int(season))
                player_stats_dict['school_url'].append(season_school_url)
                player_stats_dict['position'].append(position)
                player_stats_dict['height'].append(height)
                player_stats_dict['weight'].append(weight)
                player_stats_dict['school_year'].append(get_school_year(years_played))
                player_stats_dict['games_played'].append(games_played)
                player_stats_dict['games_started'].append(games_started)
                player_stats_dict['games_won'].append(games_won)
                player_stats_dict['mpg'].append(mpg)
                player_stats_dict['fg_pct'].append(fg_pct)
                player_stats_dict['fg3_pct'].append(fg3_pct)
                player_stats_dict['fg3a'].append(fg3a)
                player_stats_dict['ft_pct'].append(ft_pct)
                player_stats_dict['fta'].append(fta)
                player_stats_dict['off_rpg'].append(off_rpg)
                player_stats_dict['def_rpg'].append(def_rpg)
                player_stats_dict['apg'].append(apg)
                player_stats_dict['spg'].append(spg)
                player_stats_dict['bpg'].append(bpg)
                player_stats_dict['tpg'].append(tpg)
                player_stats_dict['fpg'].append(fpg)
                player_stats_dict['ppg'].append(ppg)
                player_stats_dict['per'].append(per)
                player_stats_dict['ts_pct'].append(ts_pct)
                player_stats_dict['eff_fg_pct'].append(eff_fg_pct)
                player_stats_dict['fg3a_rate'].append(fg3a_rate)
                player_stats_dict['off_reb_pct'].append(off_reb_pct)
                player_stats_dict['def_reb_pct'].append(def_reb_pct)
                player_stats_dict['ast_pct'].append(ast_pct)
                player_stats_dict['usg_pct'].append(usg_pct)
                player_stats_dict['win_shares_40_min'].append(win_shares_40_min)
                player_stats_dict['plus_minus'].append(plus_minus)
                player_stats_dict['max_points'].append(max(points))
                player_stats_dict['max_assists'].append(max(assists))
                player_stats_dict['max_steals'].append(max(steals))
                player_stats_dict['max_blocks'].append(max(blocks))
                player_stats_dict['max_rebounds'].append(max(rebounds))
                player_stats_dict['std_points'].append(np.std(points))
                player_stats_dict['std_assists'].append(np.std(assists))
                player_stats_dict['std_steals'].append(np.std(steals))
                player_stats_dict['std_blocks'].append(np.std(blocks))
                player_stats_dict['std_rebounds'].append(np.std(rebounds))
                player_stats_dict['draft'].append(draft)
                
                checked_players.append(player_url)
                if len(checked_players)%n_iterations == 0:
                    filename = f'batch_{int(len(checked_players)/n_iterations)}.csv'
                    player_stats_dict = make_checkpoint(player_stats_dict,
                                                        checked_players,
                                                        checked_schools,
                                                        filename=filename)
        
        checked_schools.append(school_url)
        
player_stats_dict = make_checkpoint(player_stats_dict, checked_players, checked_schools)