# Basketball-Reference Scraper Overview
The following code will walk you through how to scrape NBA player and game data from www.basketball-reference.com and input it into a Pandas database. In other scripts within this directory, we will be using the data captured here to run analyses that will help the average user ask both broad and specific questions related to the NBA. We will probe on what statistics and criteria are important for an NBA team to win an NBA championship, how the league has evolved year-over-year, touch on the GOAT debate, and ultimately, build algorithms that can (hopefully) help us all beat Vegas lines consistently so that we can all retire from our day jobs and gamble on the NBA for the rest of our careers. 

None of this could have been done without the tireless and comprehensive effort of those who work at [Basketball Reference](http://www.basketball-reference.com) providing an open-source, API-friendly database containing millions of datapoints from which the entirety of this codebase is built. 

For any questions/concerns, feel free to reach out to me directly at rahim.hashim@columbia.edu. And in the case that this is useful to anyone for future projects, please give credit where credit is due, both to [Basketball Reference](http://www.basketball-reference.com) and myself. Enjoy!

***
## The Basics
__Jupyter Notebook__: All of the following code is hosted in a Python 3 Jupyter Notebook. It is recommended to use Anaconda to access the Notebook in order to have synchronously have access to all Python Libraries used in the rest of the code. 

In order to execute and compile code in the notebook, go to the desired code box and press _Shift_ + _Enter_ at the same time. All code below is recommended to be executed from top to bottom in order.

__Python Libraries__: Python is a beautiful language for a number of reasons, one of which is it's immense
amount of pre-built libraries that do much of the heavy lifting in any web-scraping /
data analysis project. When getting familiar with Python and starting a new project, be
sure to look through the internet for a Python library that may help. A comprehensive list
that I often refer to before starting a project is here: [https://github.com/vinta/awesome-python](https://github.com/vinta/awesome-python)

__Installing Libraries__: In case you receive an error upon trying to execute the following box, such as _ModuleNotFoundError: No module named 'numpy'_, go back to your terminal and open a new tab, and install the library using pip: _pip install numpy_

In [1]:
%reload_ext autoreload
import os
import sys
import numpy as np
import pandas as pd
from pprint import pprint
import matplotlib.pyplot as plt
from collections import Counter, OrderedDict, defaultdict
pd.options.mode.chained_assignment = None  # default='warn'

ROOT = '/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/' #@param ['/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/']  

# add (non-Python) helper functions
def add_helpers():
  '''
  add_helper mounts google drive and adds
  helper functions to the sys.path
  '''

  # if running on juypter/google colab, mount to google drive
  if 'google.colab' in str(get_ipython()): 
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    os.chdir(ROOT)

  helper_dir_path = ROOT + 'helper/'
  print('\nHelpers:')
  pprint(sorted(os.listdir(helper_dir_path)))
  sys.path.append(helper_dir_path) # set to path of notebook

add_helpers()

Mounted at /content/drive

Helpers:
['Regions.py',
 'TeamNames.py',
 '__pycache__',
 'bettingLinesScraper.py',
 'fuzzy_lookup.py',
 'game_log_scraper.py',
 'gamelog_scraper_openai.ipynb',
 'gamelogscraper.ipynb',
 'meta_info_scraper.py',
 'player_info_scraper.py',
 'player_scraper.py',
 'player_table_scraper.py',
 'recordDateScraper.py',
 'teamsScraper.py']


***
## Creating Databases
Pandas databases are a powerful tool to query large amounts of data, as we will be doing here. For that reason, we are going to insert all of the data scraped above into a Pandas database. The below code will take player overview data from playerHash and insert it into player_df<br>
>For documentation on pandas: https://pypi.org/project/pandas/

In [2]:
DATA_PATH = ROOT + 'data/'

***
## Scraping Player Data
### Biometrics and season + career statistics

playerScraper and metaDataScraper will be doing most of the work to scrape data on each player's background and physical attributes.<br>
> Example Overview Source (last name starting with a): https://www.basketball-reference.com/players/a/<br>
> Example meta-data (Karim Abdul-Jabbar): https://www.basketball-reference.com/players/a/abdulka01.html<br>
> For documentation on requests(): https://realpython.com/python-requests/<br>
> For documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

___Time Estimates:___ This is the most computationally-intensive function in the program, requiring many url requests in order to complete.<br>
>*Without threading:* ~1hr<br>
>*With threading:* ~15min<br>

In [3]:
from player_scraper import scrape_all_players

list_players_meta, list_players_data, list_players_gamelogs = scrape_all_players(ROOT, THREAD_FLAG=True)

players_df_meta.pkl already exists
  Uploading...
players_df_data.pkl already exists
  Uploading...
players_df_gamelogs.pkl already exists
  Uploading...


TypeError: ignored

***
## Concatenating Player Data

Now that all the individual players' bio data, season-wide stats, and gamelogs have  been captured, they can be concatenated to generate 3 DataFrames holding all the players. 

In [None]:
from player_scraper import concatenate_dfs
df_players_meta, df_players_data, df_players_gamelogs = concatenate_dfs(ROOT, list_players_meta, list_players_data, list_players_gamelogs)

***
## Example Queries (Simple)

The following are example queries we can make across all of the generated tables. As can be seen below, the structure of the DataFrame allows for immense flexibility and speed gains as compared to looking at the website itself. We will utilize this structure for more specific trend-, team-, and era- related investigations.

In [None]:
df_players_meta.head(0)

In [None]:
# Player Meta Query
df_large = df_players_meta.loc[(df_players_meta['height']>80) & 
                   (df_players_meta['weight']>30)]

df_large = df_large.dropna(how='all', axis='columns') # drops all columns that are empty
display(df_large[df_large['weight'] == df_large['weight'].max()])

In [None]:
df_players_data.head(5)

In [None]:
# Player Data Query
'''
df_career = df_players_data.loc[(df_players_data['season']=='Career') & 
                   (df_players_data['season_playoffs']=='season') &
                   (df_players_data['data_type']=='advanced')]
'''
df_players_data['mp_clean'] = df_players_data['mp_per_g'].apply(float)
#df_career[df_career['ws_per_48'].notna()].sort_values('ws_per_48', ascending=False)
df_players_data[(df_players_data['season'] == '2020-21') & 
                (df_players_data['data_type'] == 'per_game') &
                (df_players_data['mp_clean'] > 40)]

In [None]:
df_players_gamelogs.head(0)

In [None]:
df_players_gamelogs.head(5)