The aim of this notebook is to scrape data from basketball reference (NBA), from Real GM (NCAA and Draft) and from Wikipedia (All-Star selections). All of these are very complete and useful websites to get basketball data.

The first one is done via basketball_reference_web_scraper, a very complete scraping library with many more utilities I have not used in this project. I suggest you to check it on your own.

On the process, all paths and directories are created automatically and divided in two main folders: original_data (data obtained from this scraping task) and derived_data (data obtained while processing and combining original data).

Due to requesting limits on each website, you may need to relaunch the process from time to time if your query is too big. In that case, I suggest you to check the last year you were scraping before the warning appeared and run it again from that year.

Note: in this project, I consider a season's year the year NBA playoffs are played. E.g.: if a season starts in 2010 and ends in 2011, it will be named season_2011.

#### NBA data

In [1]:
import pandas as pd
import os
import datetime 

from basketball_reference_web_scraper import client
from basketball_reference_web_scraper.data import OutputType
from basketball_reference_web_scraper.data import Team

# We define a range of dates to take boxscore data from. In this case, I am interested in this particular range.
date_range = pd.date_range(datetime.date(2000,1,1),datetime.date(2021,3,11),freq='d')

original_data_folder = 'original_data/'
if not os.path.exists(original_data_folder):
    os.mkdir(original_data_folder)

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)

In [7]:
# First, we create all the subdirectories.
for season in range(1985, 2020):
    os.mkdir('season_{}'.format(season))
    os.mkdir('season_{}/box_scores/'.format(season))
    for date in date_range:
        if date.year in [season, season-1]:
            os.mkdir(original_data_folder + 'season_{}/box_scores/{}_{}_{}/'.format(season, date.day, date.month, date.year))

# For each season, we scrape the totals stats and advanced stats, as well as the boxscores for the dates we defined above.
for season in range(1985, 2020):
    print('Descargando ', season)
    client.players_season_totals(
        season_end_year=season, 
        output_type=OutputType.CSV, 
        output_file_path=original_data_folder + "season_{}/{}_{}_player_season_totals.csv".format(season, season-1, season)
    )
    client.players_advanced_season_totals(
        season_end_year=season,
        output_type=OutputType.CSV,
        output_file_path=original_data_folder + "season_{}/{}_{}_advanced_player_season_totals.csv".format(season, season-1, season)
    )

    for date in date_range:
        if date.year in [season, season-1]:
            client.player_box_scores(
                day=date.day, month=date.month, year=date.year, 
                output_type=OutputType.CSV, 
                output_file_path=original_data_folder + "season_{}/box_scores/{}_{}_{}/box_scores.csv".format(season, date.day, date.month, date.year)
            )



Descargando  2004
Descargando  2005
Descargando  2006
Descargando  2007
Descargando  2008
Descargando  2009
Descargando  2010
Descargando  2011
Descargando  2012
Descargando  2013
Descargando  2014
Descargando  2015
Descargando  2016
Descargando  2017
Descargando  2018
Descargando  2019


#### NCAA data

In [22]:
from bs4 import BeautifulSoup
from requests import get
from csv import writer

if not os.path.exists('ncaa/'):
    os.mkdir('ncaa/')
    
# We define a main url we will complete iteratively for every query. In this case we are going to scrape the first 20 pages
# of 'Averages', 'Totals', 'Per_48', 'Misc_Stats' and 'Advanced_Stats', ordered by 'minutes', 'minutes', 'minutes', 'high_game' 
# and 'usg_pct' respectively, from 2003 to 2021.
base_url = 'https://basketball.realgm.com/ncaa/stats/'
    
for year in range(2003, 2022):
    print('Descargando {}'.format(year))
    for stat_type, ordering_by in zip(['Averages', 'Totals', 'Per_48', 'Misc_Stats', 'Advanced_Stats'], 
                                      ['minutes', 'minutes', 'minutes', 'high_game', 'usg_pct']):
        for page in range(1,20):
            
            url = base_url + '{}/{}/All/All/Season/All/{}/desc/{}/'.format(year, stat_type, ordering_by, page)
            r = get(url)
            soup = BeautifulSoup(r.text, 'lxml')


            # get all tables
            tables = soup.find_all('table')


            # loop over each table
            for num, table in enumerate(tables, start=1):

                # create filename
                filename = original_data_folder + 'ncaa_data/{}_{}_{}.csv'.format(year, stat_type, page)

                # open file for writing
                with open(filename, 'w') as f:

                    # store rows here
                    data = []

                    # create csv writer object
                    csv_writer = writer(f)

                    # go through each row
                    rows = table.find_all('tr')
                    for row in rows:

                        # write headers if any
                        headers = row.find_all('th')
                        if headers:
                            csv_writer.writerow([header.text.strip() for header in headers])

                        # write column items
                        columns = row.find_all('td')
                        csv_writer.writerow([column.text.strip() for column in columns])

2016
2017
2018
2019
2020
2021


In the following blocks, we load all the scraped and saved data together, concatenating all different stats in a single table. We remove players which are not in the top 2000 of all categories (which is uncommon for good players) and save it all together as a single csv.

In [39]:
stat_types = ['Averages', 'Totals', 'Per_48', 'Misc_Stats', 'Advanced_Stats']
years = range(2003,2022)
pages = range(1,20)


def load_files_year_stat_type(year, stat_type):
    file_name = original_data_folder + 'ncaa/{}_{}_{}.csv'
    return pd.concat([pd.read_csv(file_name.format(year, stat_type, page)) for page in pages], ignore_index=True)

def load_and_join_files_from_year(year):
    df = load_files_year_stat_type(year, stat_types[0])
    previous_stat_type = stat_types[0]
    for stat_type in stat_types[1:]:
        df = df.merge(load_files_year_stat_type(year, stat_type), 
                      on=['Player', 'Team'], how='outer', suffixes=('_'+previous_stat_type, '_'+stat_type))
        previous_stat_type = stat_type
    df['year'] = year
    return df

def load_and_join_ncaa_files(years=years, save=False):
    df = pd.concat([load_and_join_files_from_year(year) for year in years], ignore_index=False)
    df = df[pd.notna(df).all(axis=1)]    
    if save:
        df.to_csv(original_data_folder + 'ncaa/full_data.csv', index=False)
    return df


In [40]:
df = load_and_join_ncaa_files(save=True)
df

Unnamed: 0,#_Averages,Player,Team,GP_Averages,MPG,FGM_Averages,FGA_Averages,FG%_Averages,3PM_Averages,3PA_Averages,3P%_Averages,FTM_Averages,FTA_Averages,FT%_Averages,TOV_Averages,PF_Averages,ORB_Averages,DRB_Averages,RPG,APG,SPG,BPG,PPG,#_Totals,GP_Totals,MIN_Totals,FGM_Totals,FGA_Totals,FG%_Totals,3PM_Totals,3PA_Totals,3P%_Totals,FTM_Totals,FTA_Totals,FT%_Totals,TOV_Totals,PF_Totals,ORB_Totals,DRB_Totals,REB_Totals,AST_Totals,STL_Totals,BLK_Totals,PTS_Totals,#_Per_48,GP,MIN_Per_48,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,TOV,PF,ORB,DRB,REB_Per_48,AST_Per_48,STL_Per_48,BLK_Per_48,PTS_Per_48,#_Misc_Stats,Dbl Dbl,Tpl Dbl,40 Pts,20 Reb,20 Ast,5 Stl,5 Blk,High Game,Techs,HOB,Ast/TO,Stl/TO,FT/FGA,W's,L's,Win %,OWS,DWS,WS,#,TS%,eFG%,Total S %,ORB%,DRB%,TRB%,AST%,TOV%,STL%,BLK%,USG%,PPR,PPS,ORtg,DRtg,eDiff,FIC,PER,year
1,2.0,Luis Flores,MAN,30.0,38.9,7.7,16.9,0.455,1.9,4.8,0.386,7.4,8.2,0.902,3.3,2.3,1.6,4.0,5.6,2.9,1.9,0.4,24.6,22.0,30.0,1166.0,231.0,508.0,0.455,56.0,145.0,0.386,221.0,245.0,0.902,98.0,69.0,49.0,120.0,169.0,87.0,58.0,11.0,739.0,2.0,30.0,38.9,9.5,20.9,0.455,2.3,6.0,0.386,9.1,10.1,0.902,4.0,2.8,2.0,4.9,7.0,3.6,2.4,0.5,30.4,12.0,3.0,0.0,1.0,0.0,0.0,2.0,0.0,44.0,0.0,0.411,0.9,0.6,0.5,23.0,7.0,0.767,5.0,1.9,6.8,160.0,0.592,0.510,174.3,4.1,10.0,7.0,17.2,13.6,3.0,0.9,30.6,-3.5,1.5,118.0,98.5,19.5,428.6,26.4,2003
2,3.0,Rick Apodaca,HOF,15.0,38.8,5.5,15.6,0.355,3.0,9.1,0.331,4.0,5.7,0.698,4.1,1.7,1.1,3.7,4.8,4.3,1.0,0.5,18.1,1630.0,15.0,582.0,83.0,234.0,0.355,45.0,136.0,0.331,60.0,86.0,0.698,61.0,26.0,17.0,55.0,72.0,64.0,15.0,8.0,271.0,3.0,15.0,38.8,6.8,19.3,0.355,3.7,11.2,0.331,4.9,7.1,0.698,5.0,2.1,1.4,4.5,5.9,5.3,1.2,0.7,22.4,105.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,34.0,0.0,0.407,1.0,0.2,0.4,6.0,9.0,0.400,0.9,0.5,1.4,286.0,0.493,0.451,138.3,3.2,10.4,6.8,24.0,18.2,1.5,1.5,28.0,-3.2,1.2,100.8,106.0,-5.2,134.5,14.9,2003
3,4.0,Michael Watson,UMKC,29.0,38.8,8.5,22.6,0.377,4.1,11.6,0.350,4.4,5.9,0.753,3.7,2.4,0.8,2.9,3.7,3.8,1.4,0.2,25.5,41.0,29.0,1124.0,247.0,656.0,0.377,118.0,337.0,0.350,128.0,170.0,0.753,106.0,70.0,23.0,85.0,108.0,109.0,41.0,6.0,740.0,4.0,29.0,38.8,10.5,28.0,0.377,5.0,14.4,0.350,5.5,7.3,0.753,4.5,3.0,1.0,3.6,4.6,4.7,1.8,0.3,31.6,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,54.0,0.0,0.525,1.0,0.4,0.3,9.0,20.0,0.310,2.8,0.2,3.0,40.0,0.502,0.466,148.0,2.3,9.4,5.7,27.1,12.6,2.2,0.6,38.7,-3.1,1.1,102.6,112.7,-10.1,286.0,21.0,2003
4,5.0,Troy Bell,BC,31.0,38.6,7.2,16.4,0.441,3.4,8.5,0.402,7.3,8.6,0.847,2.5,2.1,1.5,3.0,4.6,3.7,2.3,0.2,25.2,14.0,31.0,1198.0,224.0,508.0,0.441,106.0,264.0,0.402,227.0,268.0,0.847,79.0,66.0,48.0,94.0,142.0,115.0,70.0,7.0,781.0,5.0,31.0,38.6,9.0,20.4,0.441,4.2,10.6,0.402,9.1,10.7,0.847,3.2,2.6,1.9,3.8,5.7,4.6,2.8,0.3,31.3,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.0,0.0,0.390,1.5,0.9,0.5,19.0,12.0,0.613,6.3,1.0,7.3,251.0,0.615,0.545,168.9,5.0,9.5,7.2,19.1,11.1,3.3,0.6,28.6,-0.2,1.5,128.7,105.9,22.8,498.0,26.8,2003
6,7.0,Edward Scott,CLEM,28.0,38.5,5.9,15.3,0.386,1.6,4.3,0.367,4.4,5.9,0.739,2.8,1.6,0.8,2.9,3.7,5.6,1.4,0.1,17.7,87.0,28.0,1077.0,165.0,428.0,0.386,44.0,120.0,0.367,122.0,165.0,0.739,79.0,44.0,22.0,82.0,104.0,158.0,38.0,3.0,496.0,7.0,28.0,38.5,7.4,19.1,0.386,2.0,5.3,0.367,5.4,7.4,0.739,3.5,2.0,1.0,3.7,4.6,7.0,1.7,0.1,22.1,147.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,0.0,0.472,2.0,0.5,0.4,15.0,13.0,0.536,2.9,1.1,4.0,342.0,0.490,0.437,149.2,2.6,9.5,6.0,32.0,13.5,2.2,0.3,27.4,2.5,1.2,108.6,103.7,4.9,294.6,18.6,2003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1888,1889.0,Collin Warren,MSU,22.0,21.2,3.0,7.1,0.429,0.7,2.1,0.340,1.4,2.6,0.544,1.6,1.8,0.5,2.0,2.5,1.8,1.3,0.1,8.2,1594.0,22.0,467.0,67.0,156.0,0.429,16.0,47.0,0.340,31.0,57.0,0.544,35.0,39.0,11.0,43.0,54.0,40.0,28.0,2.0,181.0,1890.0,22.0,21.2,6.9,16.0,0.429,1.6,4.8,0.340,3.2,5.9,0.544,3.6,4.0,1.1,4.4,5.6,4.1,2.9,0.2,18.6,1647.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,0.0,0.161,1.1,0.8,0.4,10.0,12.0,0.455,0.4,0.8,1.3,1304.0,0.494,0.481,131.4,2.8,10.2,6.6,14.0,16.0,3.2,0.5,22.2,-1.7,1.2,97.4,95.7,1.7,101.4,13.9,2021
1890,1891.0,Kalu Ezikpe,ODU,23.0,21.2,4.2,7.3,0.581,0.1,0.6,0.214,1.8,2.7,0.672,1.7,2.8,1.9,4.9,6.7,0.5,1.1,1.5,10.3,1503.0,23.0,487.0,97.0,167.0,0.581,3.0,14.0,0.214,41.0,61.0,0.672,40.0,65.0,43.0,112.0,155.0,12.0,26.0,34.0,238.0,1892.0,23.0,21.2,9.6,16.5,0.581,0.3,1.4,0.214,4.0,6.0,0.672,3.9,6.4,4.2,11.0,15.3,1.2,2.6,3.4,23.5,915.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,0.0,0.186,0.3,0.7,0.4,15.0,8.0,0.652,1.2,1.2,2.4,718.0,0.607,0.590,146.7,10.2,26.3,18.3,5.7,17.0,3.1,8.7,25.0,-6.7,1.4,109.4,88.9,20.5,216.4,27.7,2021
1892,1893.0,Greg Brown III,UT,25.0,21.2,3.2,7.7,0.417,1.2,3.6,0.322,2.0,2.9,0.708,2.4,3.0,1.2,5.2,6.4,0.4,0.6,1.0,9.6,1298.0,25.0,529.0,80.0,192.0,0.417,29.0,90.0,0.322,51.0,72.0,0.708,60.0,76.0,31.0,129.0,160.0,10.0,15.0,26.0,240.0,1894.0,25.0,21.2,7.3,17.4,0.417,2.6,8.2,0.322,4.6,6.5,0.708,5.4,6.9,2.8,11.7,14.5,0.9,1.4,2.4,21.8,727.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.138,0.2,0.2,0.4,18.0,7.0,0.720,0.3,1.2,1.5,429.0,0.531,0.492,144.7,6.8,26.2,16.8,3.8,21.0,1.6,5.2,26.9,-10.1,1.2,93.5,91.3,2.2,149.8,16.9,2021
1894,1895.0,Cody Carlson,WSU,23.0,21.1,3.9,6.4,0.612,0.7,1.4,0.469,1.5,2.5,0.614,1.2,2.2,1.0,3.9,4.8,0.6,0.4,0.5,10.0,1509.0,23.0,486.0,90.0,147.0,0.612,15.0,32.0,0.469,35.0,57.0,0.614,27.0,51.0,22.0,89.0,111.0,14.0,10.0,11.0,230.0,1896.0,23.0,21.1,8.9,14.5,0.612,1.5,3.2,0.469,3.5,5.6,0.614,2.7,5.0,2.2,8.8,11.0,1.4,1.0,1.1,22.7,609.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0,0.159,0.5,0.4,0.4,17.0,6.0,0.739,1.4,0.9,2.3,1621.0,0.661,0.663,169.5,6.2,21.0,14.3,5.4,13.4,1.1,2.4,20.9,-3.5,1.6,119.8,95.4,24.4,169.6,21.6,2021


#### Drafts

In [37]:
from bs4 import BeautifulSoup
from requests import get
from csv import writer
import pandas as pd

def create_csv_from_scraped_table(table, filename):
    # open file for writing
    with open(filename, 'w') as f:

        # store rows here
        data = []

        # create csv writer object
        csv_writer = writer(f)

        # go through each row
        rows = table.find_all('tr')
        for row in rows:

            # write headers if any
            headers = row.find_all('th')
            if headers:
                csv_writer.writerow([header.text.strip() for header in headers])

            # write column items
            columns = row.find_all('td')
            csv_writer.writerow([column.text.strip() for column in columns])

url = 'https://basketball.realgm.com/nba/draft/past-drafts/{}'

for year in range(1985, 2021):
    r = get(url.format(year))
    soup = BeautifulSoup(r.text, 'html')

    # get all tables
    tables = soup.find_all('table')
    tables = tables[-3:]
        
    create_csv_from_scraped_table(tables[0], original_data_folder + 'drafts/first_round_{}.csv'.format(year))
    create_csv_from_scraped_table(tables[1], original_data_folder + 'drafts/second_round_{}.csv'.format(year))
    create_csv_from_scraped_table(tables[2], original_data_folder + 'drafts/out_{}.csv'.format(year))    
        

#### All-Stars

In [None]:
from bs4 import BeautifulSoup
from requests import get
from csv import writer

url = 'https://en.wikipedia.org/wiki/List_of_NBA_All-Stars'

r = get(url)
soup = BeautifulSoup(r.text, 'html')


# get all tables
tables = soup.find_all('table')

# loop over each table
for num, table in enumerate(tables, start=1):

    # create filename
    filename = original_data_folder + 'all_star_data_{}.csv'.format(num) # solo sirve realmente el 2 o el 3

    # open file for writing
    with open(filename, 'w',  encoding='utf-8') as f:

        # store rows here
        data = []

        # create csv writer object
        csv_writer = writer(f)

        # go through each row
        rows = table.find_all('tr')
        for row in rows:

            # write headers if any
            headers = row.find_all('th')
            if headers:
                csv_writer.writerow([header.text.strip() for header in headers])

            # write column items
            columns = row.find_all('td')
            csv_writer.writerow([column.text.strip() for column in columns])