# Basketball-Reference Scraper Overview
After scraping the data (see: [basketball-reference-scraper.ipynb](https://github.com/rahim-hashim/NBA-Prediction-Algorithms/blob/df_version/basketball-reference-scraper.ipynb)), you'll have 3 DataFrames saved as a [pickle file](https://docs.python.org/3/library/pickle.html) which you can upload:

1. **df_players_meta**
  * biodata (i.e. height, age, weight)
2. **df_players_data**
  * season data (per-game, total, per-possesion)
3. **df_players_gamelogs**
  * gamelogs for all players

***
## Import and Path Assignment


In [None]:
%reload_ext autoreload
import os
import sys
import numpy as np
import pandas as pd
from pprint import pprint
import matplotlib.pyplot as plt
from collections import Counter, OrderedDict, defaultdict
pd.options.mode.chained_assignment = None  # default='warn'

ROOT = '/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/' #@param ['/content/drive/MyDrive/Projects/nba-prediction-algorithm/NBA-Prediction-Algorithms/']  

# add (non-Python) helper functions
def add_helpers():
  '''
  add_helper mounts google drive and adds
  helper functions to the sys.path
  '''

  # if running on juypter/google colab, mount to google drive
  if 'google.colab' in str(get_ipython()): 
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    os.chdir(ROOT)

  helper_dir_path = os.path.join(ROOT,'helper')
  print('\nHelpers:')
  pprint(sorted(os.listdir(helper_dir_path)))
  sys.path.append(helper_dir_path) # set to path of notebook

add_helpers()

ModuleNotFoundError: ignored

***
## Pickle Loading
“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream. Here we will "unpickle" i.e. reload the data that we pickled during the scraping.<br>
>For documentation on pickle: https://docs.python.org/3/library/pickle.html

In [None]:
from player_scraper import pickle_load

DATA_PATH = os.path.join(ROOT,'data')
sys.path.append(DATA_PATH)
pprint(sorted(os.listdir(DATA_PATH)))

players_df_meta = pickle_load(DATA_PATH+'players_df_meta.pkl')
players_df_data = pickle_load(DATA_PATH+'players_df_data.pkl')
players_df_gamelogs = pickle_load(DATA_PATH+'players_df_gamelogs.pkl')

['players_df_data.pkl', 'players_df_gamelogs.pkl', 'players_df_meta.pkl']


In [None]:
players_df_meta.columns

Index(['player_name', 'draft_year', 'retire_year', 'height', 'weight',
       'birth_date', 'college', 'shootingHand', 'highSchool', 'highSchoolCity',
       'highSchoolState', 'highSchoolCountry', 'draftTeam', 'draftRound',
       'draftRoundPick', 'draftOverallPick'],
      dtype='object')

***
## Biometrics Data

Basic analyses on biometrics data.<br>
> Example Overview Source (last name starting with a): https://www.basketball-reference.com/players/a/<br>

In [None]:
players_df_meta[players_df_meta['weight'] == max(players_df_meta['weight'])]

Unnamed: 0,player_name,draft_year,retire_year,height,weight,birth_date,college,birthCountry,birthCity,birthState,shootingHand,highSchool,highSchoolCity,highSchoolState,highSchoolCountry,draftTeam,draftRound,draftRoundPick,draftOverallPick
0,Sim Bhullar,2015,2015,89,360,"December 2, 1992",New Mexico State,Canada,Ontario,,Right,Huntington Prep,Huntington,West Virginia,United States of America,,,,


## Draft Data

In [None]:
pd.set_option('display.max_rows', 999)
players_df_meta[players_df_meta['draftOverallPick'] == 1].sort_values(by='draft_year', ascending=True)

In [None]:
from metaAnalysis import metaPlot, geographyPlot
        
metaPlot(players_df_meta)
geographyPlot(players_df_meta)

***
## Season Data

Analyses on season-wide stats. You can use widgets to filter data, as an example below:

In [None]:
#@title Table Select { run: "auto" }

#@markdown Per Game | Totals | Advanced | Per Minute | Per Possession | Adjusted Shooting | Play-By-Play | Shooting | All-Star | Salaries
table_type = "advanced" #@param ["all", "per_game", "totals", "advanced", "per_minute", "per_poss", "adjooting", "pbp", "shooting", "all_star", "all_salaries"] {allow-input: true}

#@markdown Season or Playoff Stats
season_playoffs = "season" #@param ["both", "season", "playoffs"]

#@markdown Include or Exclude Career Data
career_data = "exclude" #@param ["include", "exclude"]

def table_select(df, table_type, season_playoffs):
  df_filtered = df.copy(deep=True)
  if table_type != 'all':
    df_filtered = df[df['data_type'] == table_type]
    df_filtered = df_filtered.dropna(how='all', axis='columns')
  if season_playoffs != 'both':
    df_filtered = df_filtered[df_filtered['season_playoffs'] == season_playoffs]
  
  if career_data == 'exclude':
    df_filtered = df_filtered[df_filtered['season'] != 'Career']
  return df_filtered

table_selected = table_select(players_df_data, table_type, season_playoffs)

In [None]:
table_selected.sort_values(by='ws', ascending=False).head(10)

Unnamed: 0,data_type,season_playoffs,player_name,season,age,team_id,lg_id,pos,g,mp,...,ws_per_48,fg3a_per_fga_pct,stl_pct,blk_pct,tov_pct,usg_pct,obpm,dbpm,bpm,vorp
82,advanced,season,Kareem Abdul-Jabbar,1971-72,24.0,MIL,NBA,C,81.0,3583.0,...,0.34,,,,,,,,,
66,advanced,season,Wilt Chamberlain,1963-64,27.0,SFW,NBA,C,80.0,3689.0,...,0.325,,,,,,,,,
34,advanced,season,George Mikan,1950-51,26.0,MNL,NBA,C,68.0,,...,,,,,,,,,,
64,advanced,season,Wilt Chamberlain,1961-62,25.0,PHW,NBA,C,80.0,3882.0,...,0.286,,,,,,,,,
81,advanced,season,Kareem Abdul-Jabbar,1970-71,23.0,MIL,NBA,C,82.0,3288.0,...,0.326,,,,,,,,,
83,advanced,season,Kareem Abdul-Jabbar,1972-73,25.0,MIL,NBA,C,76.0,3254.0,...,0.322,,,,,,,,,
71,advanced,season,Wilt Chamberlain,1966-67,30.0,PHI,NBA,C,81.0,3682.0,...,0.285,,,,,,,,,
70,advanced,season,Wilt Chamberlain,1965-66,29.0,PHI,NBA,C,79.0,3737.0,...,0.275,,,,,,,,,
63,advanced,season,Michael Jordan,1987-88,24.0,CHI,NBA,SG,82.0,3311.0,...,0.308,0.027,3.9,2.4,9.6,34.1,8.8,4.2,13.0,12.5
33,advanced,season,George Mikan,1949-50,25.0,MNL,NBA,C,68.0,,...,,,,,,,,,,


***
## Gamelogs

Analyses on gamelog stats.

In [None]:
players_df_gamelogs[players_df_gamelogs['pts']==np.nanmax(players_df_gamelogs['pts'])]

Unnamed: 0,player_name,season,season_playoffs,rk,g,date,age,tm,loc,opp,...,3p,3pa,3p%,orb,drb,stl,blk,tov,gmsc,plus_minus
75,Wilt Chamberlain,1961-62,season,76,76.0,1962-03-02,25-193,PHW,,NYK,...,,,,,,,,,,


In [None]:
from player_matchup import matchup_game_finder

print(players_df_gamelogs.columns)

player = 'Kobe Bryant'
opponent = 'Tony Allen'
df_overlap = matchup_game_finder(players_df_gamelogs, player, opponent)
print(np.nanmean(df_overlap['pts']))

47 games found
Index(['player_name', 'season', 'season_playoffs', 'rk', 'g', 'date', 'age',
       'tm', 'loc', 'opp', 'win', 'gs', 'mp', 'fg', 'fga', 'fg%', 'ft', 'fta',
       'ft%', 'trb', 'ast', 'pf', 'pts', 'note', '3p', '3pa', '3p%', 'orb',
       'drb', 'stl', 'blk', 'tov', 'gmsc', 'plus_minus'],
      dtype='object')
26.463414634146343


***
# Betting Lines

In [None]:
from helper.bettingLinesScraper import scrape

start_year = 2010 #@param {type:"integer"}
end_year = 2022 #@param {type:"integer"}

betting_lines_dict = scrape(start_year, end_year)

Years:   0%|          | 0/12 [00:00<?, ?it/s]

Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2010...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2011...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2012...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2013...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2014...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2015...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2016...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2017...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2018...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2019...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2020...
Getting conference lists ...


Teams:   0%|          | 0/38 [00:00<?, ?it/s]

Initializing Records for 2021...
Getting conference lists ...


In [None]:
betting_lines_df = pd.DataFrame.from_dict(betting_lines_dict)