# Basketball-Reference Scraper Overview
The following code will walk you through how to scrape NBA player and game data from www.basketball-reference.com and input it into a Pandas database. In other scripts within this directory, we will be using the data captured here to run analyses that will help the average user ask both broad and specific questions related to the NBA. We will probe on what statistics and criteria are important for an NBA team to win an NBA championship, how the league has evolved year-over-year, touch on the GOAT debate, and ultimately, build algorithms that can (hopefully) help us all beat Vegas lines consistently so that we can all retire from our day jobs and gamble on the NBA for the rest of our careers. 

None of this could have been done without the tireless and comprehensive effort of those who work at [Basketball Reference](http://www.basketball-reference.com) providing an open-source, API-friendly database containing millions of datapoints from which the entirety of this codebase is built. 

For any questions/concerns, feel free to reach out to me directly at rahim.hashim@columbia.edu. And in the case that this is useful to anyone for future projects, please give credit where credit is due, both to [Basketball Reference](http://www.basketball-reference.com) and myself. Enjoy!

***
## The Basics
__Jupyter Notebook__: All of the following code is hosted in a Python 3 Jupyter Notebook. It is recommended to use Anaconda to access the Notebook in order to have synchronously have access to all Python Libraries used in the rest of the code. 

In order to execute and compile code in the notebook, go to the desired code box and press _Shift_ + _Enter_ at the same time. All code below is recommended to be executed from top to bottom in order.

__Python Libraries__: Python is a beautiful language for a number of reasons, one of which is it's immense
amount of pre-built libraries that do much of the heavy lifting in any web-scraping /
data analysis project. When getting familiar with Python and starting a new project, be
sure to look through the internet for a Python library that may help. A comprehensive list
that I often refer to before starting a project is here: [https://github.com/vinta/awesome-python](https://github.com/vinta/awesome-python)

__Installing Libraries__: In case you receive an error upon trying to execute the following box, such as _ModuleNotFoundError: No module named 'numpy'_, go back to your terminal and open a new tab, and install the library using pip: _pip install numpy_

In [1]:
%reload_ext autoreload
import re
import os
import sys
import requests
import datetime
import time
import threading
import importlib
import numpy as np
from timeit import timeit
import matplotlib.pyplot as plt
from collections import Counter, OrderedDict, defaultdict
from string import ascii_lowercase

#(Non-Python Library)
#stateDict is a Dictionary that I created to help with geography-based analyses
from Regions import stateDict

***
## Class Instantiation
Another reason why Python is awesome is it's easy-to-use object-oriented programming. 
In case you aren't familiar with object oriented programming - _Classes_ and 
_Objects_ are the two main aspects of object oriented programming. A class creates a 
new unique and malleable type (e.g. int, string, list) with user-designated attributes. Objects are simply instances of the class. 

Here, the __Player__ class is initiated (from playerStatObjects), with defined attributes (e.g. name, draftYear...).
Once we scrape www.basketball-reference.com, we will create type-specific objects that will each have the following attributes. 

In [2]:
from PlayerObject import Player

***
## Scraping Player Data
### Biometrics and season + career statistics

playerScraper and metaDataScraper will be doing most of the work to scrape data on each player's background and physical attributes.<br>
> Example Overview Source (last name starting with a): https://www.basketball-reference.com/players/a/<br>
> Example meta-data (Karim Abdul-Jabbar): https://www.basketball-reference.com/players/a/abdulka01.html<br>
> For documentation on requests(): https://realpython.com/python-requests/<br>
> For documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

___Time Estimates:___ This is the most computationally-intensive function in the program, requiring many url requests in order to complete.<br>
>*Without threading:* ~1hr<br>
>*With threading:* ~15min<br>

In [3]:
from metaScraper import metaInfoScraper

playersHash = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))

urls_players = []
for letter in ['y']:#ascii_lowercase:
    url = 'https://www.basketball-reference.com/players/' + letter
    urls_players.append(url)

start_datetime = datetime.datetime.now()
start_time = time.time()
print ('metaInfoScraper')
print ('\tStart Time:', str(start_datetime.time())[:11])

'''
## Thread flag decides whether you want to use parallel processing or standard
'''
thread_flag = False

# Sequential-Processing
if thread_flag == False:
    print('\tThreading inactivated...')
    for url in urls_players:
        playersHash = metaInfoScraper(url, playersHash)
        print(playersHash)
    print(playersHash['regular_season']['shooting'].keys())
# Parallel-Processing
else:
    print('\tThreading activated...')
    threads = []
    for url in urls_players:
        thread = threading.Thread(target=metaInfoScraper, args=(url,playersHash,))
        threads += [thread]
        thread.start()
    for thread in threads:
        thread.join() # makes sure that the main program waits until all threads have terminated
end_time = time.time()
print ('\tRun Time:', str((end_time - start_time)/60)[:6], 'min')

metaInfoScraper
	Start Time: 16:10:51.46
	Threading inactivated...
          player_name   season age team_id lg_id pos   g gs mp_per_g fg_per_g  \
0  Guerschon Yabusele  2017-18  22     BOS   NBA  PF  33  4      7.1      0.8   

   ... ft_pct orb_per_g drb_per_g trb_per_g ast_per_g stl_per_g blk_per_g  \
0  ...   .682       0.5       1.1       1.6       0.5       0.1       0.2   

  tov_per_g pf_per_g pts_per_g  
0       0.4      0.7       2.4  

[1 rows x 31 columns]
          player_name   season age team_id lg_id pos   g gs mp_per_g fg_per_g  \
0  Guerschon Yabusele  2018-19  23     BOS   NBA  PF  41  1      6.1      0.9   

   ... ft_pct orb_per_g drb_per_g trb_per_g ast_per_g stl_per_g blk_per_g  \
0  ...   .682       0.6       0.7       1.3       0.4       0.2       0.2   

  tov_per_g pf_per_g pts_per_g  
0       0.4      0.8       2.3  

[1 rows x 31 columns]
          player_name  season age team_id lg_id pos   g gs mp_per_g fg_per_g  \
0  Guerschon Yabusele  Career          

***
## Scraping Game Data
### Game-logs and team statistics

In [4]:
import seasonScraper
importlib.reload(seasonScraper)
from teamsScraper import teamsScraper

seasonsHash = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))

YEAR_START = 1947
YEAR_CURRENT = 2021
LEAGUES = ['NBA', 'ABA']

urls_seasons = []
for year in range(YEAR_START, YEAR_CURRENT):
    # Easiest solution for exception years in which both the NBA and ABA existed (i.e. 1967-1976)
    stem = 'https://www.basketball-reference.com/leagues/'
    for league in LEAGUES:
        # Example url = https://www.basketball-reference.com/leagues/NBA_2020.html
        url = stem + league + '_'+ str(year) + '.html'
        urls_seasons.append(url)

start_datetime = datetime.datetime.now()
start_time = time.time()
print ('seasonScraper')
print ('   Start Time:', str(start_datetime.time())[:11])

'''
Thread flag decides whether you want to
use parallel processing or standard
'''
thread_flag = False

'''Dictionary of all NBA teams'''
teamsHash = teamsScraper() 

# Sequential-Processing
if thread_flag == False:
    print('    Threading inactivated')
    for url in urls_seasons:
        league = url[-13:-10]
        year = url[-9:-5]
        seasonsHash[league][year] = seasonScraper.seasonInfoScraper(url, seasonsHash)
        print(f'      Scraping NBA Season: {year}\r', end="")
    print()
# Parallel-Processing
else:
    print('    Threading activated')
    threads = []
    for url in urls_seasons:
        thread = threading.Thread(target=seasonInfoScraper, args=(url,seasonsHash,))
        threads += [thread]
        thread.start()
    for thread in threads:
        thread.join() # makes sure that the main program waits until all threads have terminated
end_time = time.time()
print ('   Run Time:', str((end_time - start_time)/60)[:6], 'min')

seasonScraper
   Start Time: 00:44:34.03
    Threading inactivated
      Scraping NBA Season: 2020
   Run Time: 0.2901 min


***
## Data Organization
To help us understand how all the data is organized, here's a visual:

In [5]:
print(playersHash['Kareem Abdul-Jabbar']['stats']['regular_season'].keys())

dict_keys([])


***
## Meta-Data Analysis
Now that we've scraped all the meta-info on each player, we can start running analyses.

Below, a few simple analyses are included to help you get started. The first set of graphs examine height distribution (left), weight distribution (middle), and shooting handedness (right).

In [6]:
def metaPlot():
    height_list = []; weight_list = []
    rightCount = 0; leftCount = 0; noHandCount = 0
    for player in playersHash.keys():
        try:
            height_list.append(int(playersHash[player]['meta_info'].height))
        except:
            pass
        try:
            weight_list.append(int(playersHash[player]['meta_info'].weight))
        except:
            pass
        if playersHash[player]['meta_info'].shootingHand == 'Right':
            rightCount+=1
        elif playersHash[player]['meta_info'].shootingHand == 'Left':
            leftCount+=1
        else:
            noHandCount+=1

    #Plot Height Distribution (1, Left)
    f, ax = plt.subplots(1,3)
    #Sets default plot size
    plt.rcParams['figure.figsize'] = (10,8)
    n1, bins1, patches1 = ax[0].hist(height_list, bins=20, density=True, histtype='bar', ec='black')
    #Converting y-axis labels from decimals to percents
    y_vals = ax[0].get_yticks(); ax[0].set_yticklabels(['{:3.1f}%'.format(y*100) for y in y_vals])
    #Converting x-axis labels from inches back to feet
    xticks1 = ['5-0', '5-6', '6-0', '6-6', '7-0', '7-6', '8-0']
    ax[0].set_xticks([60, 66, 72, 78, 84, 90, 96])
    ax[0].set_xticklabels(xticks1)
    ax[0].set_xlim([56,100])
    ax[0].set_xlabel('Height', fontweight='bold', labelpad=10)
    ax[0].set_ylabel('Percent of Players', fontweight='bold', labelpad=10)

    #Plot Weight Distribution (1, Middle)
    ax[1].hist(weight_list, bins='auto', density=True, histtype='bar', ec='black')
    y_vals = ax[1].get_yticks()
    ax[1].set_yticklabels(['{:3.1f}%'.format(y*100) for y in y_vals])
    xticks2 = ['150', '180', '210', '240', '270', '300', '330']
    ax[1].set_xticks([150, 180, 210, 240, 270, 300, 330])
    ax[1].set_xticklabels(xticks2)
    ax[1].set_xlim([120,360])
    ax[1].set_xlabel('Weight', fontweight='bold', labelpad=10)
    ax[1].set_ylabel('Percent of Players', fontweight='bold', labelpad=10)

    #Plot Shooting Handedness (1, Right)
    ax[2].bar([1,2,3], [rightCount,leftCount,noHandCount], ec='black')
    ax[2].set_xticks([1,2,3]); ax[2].set_xticklabels(['Right','Left', 'N/A'])
    ax[2].set_xlabel('Shooting Handedness', fontweight='bold', labelpad=10)
    ax[2].set_ylabel('Number of Players', fontweight='bold', labelpad=10)
    
    plt.tight_layout(pad=0.05, w_pad=4, h_pad=1.0)
    f.set_size_inches(18.5, 10.5, forward=True)
    plt.show()
        
metaPlot()

AttributeError: 'collections.defaultdict' object has no attribute 'shootingHand'

In [None]:
def geographyPlot():
    stateList = []; countryList = []
    for player in playersHash.keys():
        stateList.append(playersHash[player]['meta_info'].birthState)
        countryList.append(playersHash[player]['meta_info'].birthCountry)
    #stateList contains all players born in the US
    stateList = filter(lambda x: x != '', stateList)
    stateHash = dict(Counter(stateList))
    stateHash = OrderedDict(sorted(stateHash.items(), reverse=True, key=lambda t: t[1]))
    #countryList contains all players born in ex-US
    countryList = filter(lambda x: x != 'United States of America', countryList)
    countryList = filter(lambda x: x != '', countryList)
    countryHash = dict(Counter(countryList))
    countryHash = OrderedDict(sorted(countryHash.items(), reverse=True, key=lambda t: t[1]))


    #Plot Birth State of US-Born Players (2)
    f, ax = plt.subplots(1)
    stateList = stateHash.keys(); stateVals = stateHash.values()
    ax.bar(np.arange(len(stateList)), stateVals, ec='black')
    ax.set_xticks(np.arange(len(stateList)))
    ax.set_xticklabels(stateList, rotation=90, ha='right', fontsize=7)
    ax.set_xlabel('US State of Birth', fontweight='bold', labelpad=10)
    ax.set_ylabel('Number of Players', fontweight='bold', labelpad=10)
    plt.show();

    #Plot Birth Countries of non-US-Born Players (3)
    f, ax = plt.subplots(1)
    countryList = countryHash.keys(); countryVals = countryHash.values()
    ax.bar(np.arange(len(countryList)), countryVals, ec='black')
    ax.set_xticks(np.arange(len(countryList)))
    ax.set_xticklabels(countryList, rotation=90, ha='right', fontsize=7)
    ax.set_xlabel('Country of Birth', fontweight='bold', labelpad=10)
    ax.set_ylabel('Number of Players', fontweight='bold', labelpad=10)
    
    f.set_size_inches(18.5, 10.5, forward=True)
    plt.show()
    
geographyPlot()

***
## Creating Databases
Pandas databases are a powerful tool to query large amounts of data, as we will be doing here. For that reason, we are going to insert all of the data scraped above into a Pandas database. The below code will take player overview data from playerHash and insert it into player_df<br>
>For documentation on pandas: https://pypi.org/project/pandas/

***
## Scraping All Players Table Statistics (perGame, total, per36, etc)

### perGameScraper

## Table Generation

Just like with playerTable, we're going to generate SQLite tables for all of the other tables we've scraped, in order to quickly access information and generate immediate queries.

## Example Queries (Simple)

The following are example queries we can make across all of the generated tables. As can be seen below, the structure of each SQLite table allows for immense flexibility and speed gains as compared to looking at the website itself. We will utilize this structure for more specific trend-, team-, and era- related investigations.