# Basketball-Reference Scraper Overview
The following code will walk you through how to scrape NBA player and game data from www.basketball-reference.com and input it into a Pandas database. In other scripts within this directory, we will be using the data captured here to run analyses that will help the average user ask both broad and specific questions related to the NBA. We will probe on what statistics and criteria are important for an NBA team to win an NBA championship, how the league has evolved year-over-year, touch on the GOAT debate, and ultimately, build algorithms that can (hopefully) help us all beat Vegas lines consistently so that we can all retire from our day jobs and gamble on the NBA for the rest of our careers. 

None of this could have been done without the tireless and comprehensive effort of those who work at [Basketball Reference](http://www.basketball-reference.com) providing an open-source, API-friendly database containing millions of datapoints from which the entirety of this codebase is built. 

For any questions/concerns, feel free to reach out to me directly at rahim.hashim@columbia.edu. And in the case that this is useful to anyone for future projects, please give credit where credit is due, both to [Basketball Reference](http://www.basketball-reference.com) and myself. Enjoy!

***
## The Basics
__Jupyter Notebook__: All of the following code is hosted in a Python 3 Jupyter Notebook. It is recommended to use Anaconda to access the Notebook in order to have synchronously have access to all Python Libraries used in the rest of the code. 

In order to execute and compile code in the notebook, go to the desired code box and press _Shift_ + _Enter_ at the same time. All code below is recommended to be executed from top to bottom in order.

__Python Libraries__: Python is a beautiful language for a number of reasons, one of which is it's immense
amount of pre-built libraries that do much of the heavy lifting in any web-scraping /
data analysis project. When getting familiar with Python and starting a new project, be
sure to look through the internet for a Python library that may help. A comprehensive list
that I often refer to before starting a project is here: [https://github.com/vinta/awesome-python](https://github.com/vinta/awesome-python)

__Installing Libraries__: In case you receive an error upon trying to execute the following box, such as _ModuleNotFoundError: No module named 'numpy'_, go back to your terminal and open a new tab, and install the library using pip: _pip install numpy_

In [1]:
%reload_ext autoreload
import re
import os
import sys
import numpy as np
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from collections import Counter, OrderedDict, defaultdict
pd.options.mode.chained_assignment = None  # default='warn'

ROOT = '/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/'

# add (non-Python) helper functions
def add_helpers():
  '''
  add_helper mounts google drive and adds
  helper functions to the sys.path
  '''

  # if running on juypter/google colab, mount to google drive
  if 'google.colab' in str(get_ipython()): 
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    %cd drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/

  helper_dir_path = ROOT + 'helper/'
  print('\nHelpers:')
  pprint(sorted(os.listdir(helper_dir_path)))
  sys.path.append(helper_dir_path) # set to path of notebook

add_helpers()

Mounted at /content/drive
/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms

Helpers:
['Regions.py',
 'TeamNames.py',
 '__pycache__',
 'bettingLinesScraper.py',
 'gameLogScraper.py',
 'meta_info_scraper.py',
 'player_info_scraper.py',
 'player_table_scraper.py',
 'seasonScraper.py',
 'teamsScraper.py']


***
## Creating Databases
Pandas databases are a powerful tool to query large amounts of data, as we will be doing here. For that reason, we are going to insert all of the data scraped above into a Pandas database. The below code will take player overview data from playerHash and insert it into player_df<br>
>For documentation on pandas: https://pypi.org/project/pandas/

***
## Scraping Player Data
### Biometrics and season + career statistics

playerScraper and metaDataScraper will be doing most of the work to scrape data on each player's background and physical attributes.<br>
> Example Overview Source (last name starting with a): https://www.basketball-reference.com/players/a/<br>
> Example meta-data (Karim Abdul-Jabbar): https://www.basketball-reference.com/players/a/abdulka01.html<br>
> For documentation on requests(): https://realpython.com/python-requests/<br>
> For documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

___Time Estimates:___ This is the most computationally-intensive function in the program, requiring many url requests in order to complete.<br>
>*Without threading:* ~1hr<br>
>*With threading:* ~15min<br>

In [2]:
from player_scraper import scrape_all_players

ROOT = '/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/'
df_players_meta, df_players_data = scrape_all_players(ROOT, THREAD_FLAG=True)

Running meta_info_scraper.py
  Start Time: 23:48:24.28
  Threading activated...
	  x' Players Captured:  0
	  q' Players Captured:  6
	  u' Players Captured:  11
	  z' Players Captured:  20
	  y' Players Captured:  19
	  i' Players Captured:  26
	  v' Players Captured:  59
	  o' Players Captured:  95


Exception in thread Thread-23:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/helper/meta_info_scraper.py", line 73, in meta_info_scraper
    df_meta, df_player_data = player_info_scraper(player_name, playerURL)
  File "/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/helper/player_info_scraper.py", line 69, in player_info_scraper
    df_player = player_table_scraper(playerName, soup)
  File "/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/helper/player_table_scraper.py", line 76, in player_table_scraper
    df_player = pd.concat(statHashList, ignore_index=True)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/concat.py", line 284, in

	  n' Players Captured:  105
	  e' Players Captured:  106
	  f' Players Captured:  148
	  k' Players Captured:  170
	  a' Players Captured:  172
	  t' Players Captured:  193
	  p' Players Captured:  217
	  d' Players Captured:  242
	  g' Players Captured:  246
	  r' Players Captured:  253
	  j' Players Captured:  238
	  c' Players Captured:  301
	  h' Players Captured:  351
	  w' Players Captured:  373
	  s' Players Captured:  420
	  b' Players Captured:  468


  0%|          | 13/4798 [00:00<00:38, 125.77it/s]

	  m' Players Captured:  463
  Run Time: 45.218 min
  Concatenating DataFrames


100%|██████████| 4798/4798 [32:51<00:00,  2.43it/s]


  Concatenating complete
Saving players_df_meta.pkl
  Path: /content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/data/players_df_meta.pkl
Saving players_df_data.pkl
  Path: /content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/data/players_df_data.pkl
  Size (meta info): 4.5MB
  Size (player data): 328.6MB
Complete.


***
## Example Queries (Simple)

The following are example queries we can make across all of the generated tables. As can be seen below, the structure of the DataFrame allows for immense flexibility and speed gains as compared to looking at the website itself. We will utilize this structure for more specific trend-, team-, and era- related investigations.

In [11]:
df_players_meta.head(0)

Unnamed: 0,player_name,draft_year,retire_year,height,weight,birth_date,college,birthCountry,birthCity,birthState,shootingHand,highSchool,highSchoolCity,highSchoolState,highSchoolCountry,draftTeam,draftRound,draftRoundPick,draftOverallPick


In [3]:
# Player Meta Query
df_large = df_players_meta.loc[(df_players_meta['height']>80) & 
                   (df_players_meta['weight']>30)]

df_large = df_large.dropna(how='all', axis='columns')
display(df_large[df_large['weight'] == df_large['weight'].max()])

Unnamed: 0,player_name,draft_year,retire_year,height,weight,birth_date,college,birthCountry,birthCity,birthState,shootingHand,highSchool,highSchoolCity,highSchoolState,highSchoolCountry,draftTeam,draftRound,draftRoundPick,draftOverallPick
0,Sim Bhullar,2015,2015,89,360,"December 2, 1992",New Mexico State,Canada,Ontario,,Right,Huntington Prep,Huntington,West Virginia,United States of America,,,,


In [12]:
df_players_data.head(0)

Unnamed: 0,data_type,season_playoffs,player_name,season,age,team_id,lg_id,pos,g,gs,mp_per_g,fg_per_g,fga_per_g,fg_pct,ft_per_g,fta_per_g,ft_pct,trb_per_g,ast_per_g,pf_per_g,pts_per_g,mp,fg,fga,ft,fta,trb,ast,pf,pts,per,ts_pct,fta_per_fga_pct,orb_pct,drb_pct,trb_pct,ast_pct,DUMMY,ows,dws,ws,ws_per_48,fg3_per_g,fg3a_per_g,fg3_pct,fg2_per_g,fg2a_per_g,fg2_pct,efg_pct,orb_per_g,drb_per_g,stl_per_g,blk_per_g,tov_per_g,fg3,fg3a,fg2,fg2a,orb,drb,stl,blk,tov,fg3a_per_fga_pct,stl_pct,blk_pct,tov_pct,usg_pct,obpm,dbpm,bpm,vorp,trp_dbl


In [4]:
# Player Data Query
df_career = df_players_data.loc[(df_players_data['season']=='Career') & 
                   (df_players_data['season_playoffs']=='season') &
                   (df_players_data['data_type']=='advanced')]

df_career = df_career.dropna(how='all', axis='columns')
display(df_career.sort_values('ws_per_48', ascending=False))

Unnamed: 0,data_type,season_playoffs,player_name,season,lg_id,g,mp,per,ts_pct,fta_per_fga_pct,orb_pct,drb_pct,trb_pct,ast_pct,ows,dws,ws,ws_per_48,fg3a_per_fga_pct,stl_pct,blk_pct,tov_pct,usg_pct,obpm,dbpm,bpm,vorp
5,advanced,season,Chad Gallagher,Career,NBA,2.0,3.0,66.8,1.000,.000,0.0,0.0,0.0,0.0,0.1,0.0,0.1,1.442,.000,0.0,0.0,0.0,44.2,20.4,9.6,30.1,0.0
5,advanced,season,Tyson Wheeler,Career,NBA,1.0,3.0,76.1,1.064,2.000,0.0,0.0,0.0,100.0,0.1,0.0,0.1,1.367,1.000,0.0,0.0,0.0,28.4,42.5,8.5,51.1,0.0
5,advanced,season,Dave Scholz,Career,NBA,1.0,1.0,67.6,1.000,.000,,,,0.0,0.0,0.0,0.0,1.316,,,,,,,,,
5,advanced,season,Steven Hill,Career,NBA,1.0,2.0,88.3,1.000,.000,100.0,58.6,86.4,0.0,0.0,0.0,0.0,.873,.000,0.0,0.0,0.0,22.1,30.1,-0.9,29.2,0.0
5,advanced,season,Dennis Van Zant,Career,ABA,1.0,2.0,39.3,1.136,,0.0,48.0,24.9,0.0,0.0,0.0,0.0,.856,,0.0,0.0,0.0,16.9,-4.5,0.1,-4.4,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,advanced,season,Dick Murphy,Career,BAA,31.0,,,.215,.120,,,,,-0.6,0.2,-0.5,,,,,,,,,,
11,advanced,season,John Murphy,Career,BAA,20.0,,,.343,.375,,,,,0.2,0.1,0.3,,,,,,,,,,
19,advanced,season,Angelo Musi,Career,BAA,161.0,,,.330,.164,,,,,1.0,5.0,6.0,,,,,,,,,,
38,advanced,season,Tommy Byrnes,Career,TOT,264.0,,,.348,.284,,,,,1.5,5.2,6.7,,,,,,,,,,


***
## Single Player Queries

If you don't need to scrape the entire basketball-reference website but instead just want information for one player, you can use the following command.

In [None]:
from player_info_scraper import player_info_scraper
from player_scraper import scrape_single_player

pd.set_option('display.max_rows', None)
single_player_info, single_player_df = scrape_single_player()
single_player_df

***
## Scraping Game Data
### Game-logs and team statistics

In [None]:
import seasonScraper
from teamsScraper import teamsScraper

seasonsHash = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))

YEAR_START = 1947
YEAR_CURRENT = 2021
LEAGUES = ['NBA', 'ABA']

urls_seasons = []
for year in range(YEAR_START, YEAR_CURRENT):
    # Easiest solution for exception years in which both the NBA and ABA existed (i.e. 1967-1976)
    stem = 'https://www.basketball-reference.com/leagues/'
    for league in LEAGUES:
        # Example url = https://www.basketball-reference.com/leagues/NBA_2020.html
        url = stem + league + '_'+ str(year) + '.html'
        urls_seasons.append(url)

start_datetime = datetime.datetime.now()
start_time = time.time()
print ('seasonScraper')
print ('   Start Time:', str(start_datetime.time())[:11])

'''
Thread flag decides whether you want to
use parallel processing or standard
'''
thread_flag = False

'''Dictionary of all NBA teams'''
teamsHash = teamsScraper() 

# Sequential-Processing
if thread_flag == False:
    print('    Threading inactivated')
    for url in urls_seasons:
        league = url[-13:-10]
        year = url[-9:-5]
        seasonsHash[league][year] = seasonScraper.seasonInfoScraper(url, seasonsHash)
        print(f'      Scraping NBA Season: {year}\r', end="")
    print()
# Parallel-Processing
else:
    print('    Threading activated')
    threads = []
    for url in urls_seasons:
        thread = threading.Thread(target=seasonInfoScraper, args=(url,seasonsHash,))
        threads += [thread]
        thread.start()
    for thread in threads:
        thread.join() # makes sure that the main program waits until all threads have terminated
end_time = time.time()
print ('   Run Time:', str((end_time - start_time)/60)[:6], 'min')

***
## Data Organization
To help us understand how all the data is organized, here's a visual:

In [None]:
df_career = df_players_all.loc[(df_players_all['season']=='Career') & 
                   (df_players_all['season_playoffs']=='season') &
                   (df_players_all['data_type']=='advanced')]

df_career.dropna(axis='columns')

***
## Meta-Data Analysis
Now that we've scraped all the meta-info on each player, we can start running analyses.

Below, a few simple analyses are included to help you get started. The first set of graphs examine height distribution (left), weight distribution (middle), and shooting handedness (right).

In [None]:
from Regions import stateDict #stateDict is a Dictionary to help with geography-based analyses
def metaPlot():
  
  height_list = df_players_meta['height'].tolist()
  weight_list = df_players_meta['weight'].tolist()
  #rightCount = 0; leftCount = 0; noHandCount = 0

  #Plot Height Distribution (1, Left)
  f, ax = plt.subplots(1,2)
  #Sets default plot size
  plt.rcParams['figure.figsize'] = (10,8)
  n1, bins1, patches1 = ax[0].hist(height_list, bins=20, density=True, histtype='bar', ec='black')
  #Converting y-axis labels from decimals to percents
  y_vals = ax[0].get_yticks(); ax[0].set_yticklabels(['{:3.1f}%'.format(y*100) for y in y_vals])
  #Converting x-axis labels from inches back to feet
  xticks1 = ['5-0', '5-6', '6-0', '6-6', '7-0', '7-6', '8-0']
  ax[0].set_xticks([60, 66, 72, 78, 84, 90, 96])
  ax[0].set_xticklabels(xticks1)
  ax[0].set_xlim([56,100])
  ax[0].set_xlabel('Height', fontweight='bold', labelpad=10)
  ax[0].set_ylabel('Percent of Players', fontweight='bold', labelpad=10)

  #Plot Weight Distribution (1, Middle)
  ax[1].hist(weight_list, bins='auto', density=True, histtype='bar', ec='black')
  y_vals = ax[1].get_yticks()
  ax[1].set_yticklabels(['{:3.1f}%'.format(y*100) for y in y_vals])
  xticks2 = ['150', '180', '210', '240', '270', '300', '330']
  ax[1].set_xticks([150, 180, 210, 240, 270, 300, 330])
  ax[1].set_xticklabels(xticks2)
  ax[1].set_xlim([120,360])
  ax[1].set_xlabel('Weight', fontweight='bold', labelpad=10)
  ax[1].set_ylabel('Percent of Players', fontweight='bold', labelpad=10)
  
  plt.tight_layout(pad=0.05, w_pad=4, h_pad=1.0)
  f.set_size_inches(18.5, 10.5, forward=True)
  plt.show()
        
metaPlot()

In [None]:
def geographyPlot():
  countryList = df_players_meta['birthCountry'].tolist()
  #countryList contains all players born in ex-US
  countryList = filter(lambda x: x != 'United States of America', countryList)
  countryList = filter(lambda x: x != '', countryList)
  countryHash = dict(Counter(countryList))
  countryHash = OrderedDict(sorted(countryHash.items(), reverse=True, key=lambda t: t[1]))

  pprint(countryHash.keys())

  #Plot Birth Countries of non-US-Born Players (3)
  f, ax = plt.subplots(1)
  countryList = countryHash.keys(); countryVals = countryHash.values()
  ax.bar(np.arange(len(countryList)), countryVals, ec='black')
  ax.set_xticks(np.arange(len(countryList)))
  ax.set_xticklabels(countryList, rotation=90, ha='right', fontsize=7)
  ax.set_xlabel('Country of Birth', fontweight='bold', labelpad=10)
  ax.set_ylabel('Number of Players', fontweight='bold', labelpad=10)
  
  f.set_size_inches(18.5, 10.5, forward=True)
  plt.show()
    
geographyPlot()