In [None]:
EAST_TEAMS = ['MIL', 'BOS', 'PHI', 'CLE', 'NYK', 'BRK', 'MIA', 'ATL', 'TOR', 'CHI', 'IND', 'WAS', 'ORL', 'CHO', 'DET']
WEST_TEAMS = ['DEN', 'MEM', 'SAC', 'PHO', 'LAC', 'GSW', 'LAL', 'MIN', 'NOP', 'OKC', 'DAL', 'UTA', 'POR', 'HOU', 'SAS']

# The Game Plan

- Pages of the format https://www.basketball-reference.com/teams/{TEAM}/{YEAR}_games.html contain
an exhaustive schedule of a team's games during the regular season + playoffs

- From these pages, we can pull W/L as well as navigate to a page of 
the format https://www.basketball-reference.com/boxscores/{DATE}{HOME_TEAM}.html
where we can pull the officiating team. For this scraping step it's probably most straightforward
to find the path value from somewhere inside the main table element, but we could also recreate
the date serialization format basketball ref uses and use a dictionary from team names
to their shortened "tickers".

- In general, we may want to create a dictionary corresponding to each team that looks like this:
```python
{
    TEAM: {
        'YoY': {
            YEAR: {
                'W': XX,
                'L': XX,
                'WLBR': {
                    REF_NAME: {
                        'W': XX,
                        'L': XX
                    },
                    ...
                }
            },
            ...
        },

        # Can be calculated at a later step
        'Total': {
            'W': XX,
            'L': XX,
            'WLBR': {
                REF_NAME:{
                    'W': XX,
                    'L': XX
                },
                ...
            }
        }
    },
    ...
}
```

# More Gameplanning

The way this data is structured and the amount of variability in ways we may want to query it almost makes me want to do this all with SQL. I don't know if this makes me blind / an idiot or incredibly smart. In this case, maybe we spin up a local SQL driver? This makes our data collection and structuring process easier and more flexible...

In THIS case, we may need to define separate schema for teams, games, and officials.
For now, let's avoid tracking data for teams and just run with the team "tickers".

```sql
Game:
    id: integer,
    good_guys: str,
    bad_guys: str,
    gg_points: integer,
    bg_points: integer,
    win: boolean,
    date: str

Game_Official: 
    game_id: integer,
    official_id: integer

Official:
    id: integer,
    name: str
```

The other thing to note here is that we're going go to scrape all games from the "perspective" 
of both teams. That means when we're writing queries we need to be careful to not double count.
On the other hand, this gives us a nice little invariant to test the robustness of
our scraping pipeline.

# The Scraping

Okay, now we've setup our sql driver and have all our tables are in a row. We have a slick little DB class that makes use of the `__enter__` and `__exit__` modifiers and some fancy _encapsulation_ going on. Let's run a test scrape on all of the Miami Heat's 2021-22 NBA games and insert them into our local sqlite db.

In [2]:
from bs4 import BeautifulSoup
import requests

TEAM = 'MIA'
YEAR = '2022'
url = f'https://www.basketball-reference.com/teams/{TEAM}/{YEAR}_games.html'

res = requests.get(url)

In [None]:
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find(id='games')

for tr in table.tbody:
    print(tr)
    exit(0)

# A Little Obstacle

Basketball reference has a 20 request/minute rate limit on their website to prevent scrapers (like me). 
This is actually really annoying because we need to make a secondary request for every team's
game to reach the page containing the officials' names. If we do some math here, scraping ~80 games 
for 30 teams for even just 10 years means 24,000 requests. Even if we're able to perfectly
execute 20 request a minute, this takes 20 hours.

In [1]:
import pickle

file = open('pickles/MIA_2022.pickle', 'rb')
games, officials = pickle.load(file)
file.close()