# Preparing RAPM Data with PBPStats

Ford Higgins

Calculating an all-in-one metric has felt like a rite of passage for NBA analytics for a long time, at least to me. But gathering the data, parsing the play by play, and then tuning an RAPM model requires knowledge of every step in a data pipeline. In addition to that, there are now several established metrics (DARKO, DRIP, EPM, LEBRON) that are more advanced than standard RAPM and seem as good as possible given the publicly available data. It's still a good exercise to understand how the RAPM sausage is made considering it's the basis for many of these metrics before they incorporate data from other sources such as box score and height / weight measurements.

I'll cover how to get the data using Darryl Blackport's excellent `pbpstats` [package](https://github.com/dblackrun/pbpstats), transform the data so it fits the inputs for Ryan Davis's also excellent [RAPM tutorial](https://github.com/rd11490/NBA_Tutorials/tree/master/rapm), and then hand it off for the reader to finish. 

I'll note that Ryan also has a tutorial on [parsing play by play data](https://github.com/rd11490/NBA_Tutorials/tree/master/play_by_play_parser) that covers almost everything, but Darryl's package seems to cover even more and requires less code, though of course it means depending on the continued maintenance of another package. Hopefully this tutorial provides some insight into how the `pbpstats` package is structured and how it could be used for other analyses. The documentation for Darryl's package may be found [here](https://pbpstats.readthedocs.io/en/latest/index.html).

In [2]:
# some general Python imports
import numpy as np
import pandas as pd
import time
from tqdm import tqdm
from typing import List, Tuple

# pbpstats imports
import pbpstats
from requests.exceptions import ReadTimeout
from pbpstats.resources.enhanced_pbp.start_of_period import InvalidNumberOfStartersException
from pbpstats.client import Client
from pbpstats.resources.enhanced_pbp import enhanced_pbp_item
from pbpstats.data_loader.stats_nba.possessions.loader import TeamHasBackToBackPossessionsException
from pbpstats.resources.enhanced_pbp.rebound import EventOrderError

Set up for the `pbpstats` client that will serve as our base for pulling all of the data from NBA.com via the package. We are specifying the directory for the data to be saved in and the source and data provider for the data. The client will automatically save the files in the data directory if you have it set up, which allows future runs to use `file` as the `source` instead of the web, limiting the hits to the NBA's API. Darryl has directions [here](https://pbpstats.readthedocs.io/en/latest/quickstart.html#setup-data-directory-optional-but-recommended) on setting up the data directory, or you can look at how I have it set up in the repo.

The data provider is another important choice with options of `data_nba`, `live`, and `stats_nba`. Darryl has a good description of the differences between the providers [here](https://pbpstats.readthedocs.io/en/latest/quickstart.html#data-nba-com-vs-stats-nba-com). I initially tried using `stats_nba` for this project, but ran into several exceptions that I couldn't figure out how to fix (though Darryl [implied on Twitter](https://twitter.com/bballport/status/1465458943178723332) that `stats_nba` should have less issues) and ultimately went with `data_nba`. Therefore you can remove all exceptions from the import list except (pun intended) `InvalidNumberOfStartersException` if you are using `data_nba` as the data provider.

However, `data_nba` only goes back to the 2016-17 season and there will be issues with the `stats_nba` play by play. Darryl put together a list of common errors and how to fix them manually [in the package wiki](https://github.com/dblackrun/pbpstats/wiki/Fixing-issues-with-raw-play-by-play).

This client is how the `pbpstats` package accesses different bits of data, so the initialization is important. We'll use it to pull season and game level objects.

In [9]:
data_provider = "data_nba"

settings = {
    "dir": "data/",
    "Games": {"source": "web", "data_provider": data_provider},
    "Possessions": {"source": "web", "data_provider": data_provider},
}
client = Client(settings)

This class defines the individual game's RAPM data, which is derived from the `pbpstats` `Game` object. It has one method for extracting the relevant possession level data from all of the game's possessions and another for transforming the data into an RAPM-structured dataframe.

If you ever want to do possession level analysis using `pbpstats`, then you will need to loop through the possessions and their events as I do in the `gameDataRAPM.get_game_rapm_possessions()` method. You'll just need to style the section inside the second for loop to match your needs. Check out the documentation for [possession](https://pbpstats.readthedocs.io/en/latest/pbpstats.resources.possessions.html) and [enhanced PBP item objects](https://pbpstats.readthedocs.io/en/latest/pbpstats.resources.enhanced_pbp.html).

In [4]:
class gameDataRAPM:
    """
    Object for gathering and holding the RAPM-relevant data for each game.
    """
    def __init__(self):
        self.current_players = list()
        self.pbp_scores = list()
        self.pbp_offense_team = list()
        self.rapm_poss = list()
        self.rapm_df = pd.DataFrame()
    
    def get_game_rapm_possessions(self, game: 'pbpstats.objects.game.Game'):
        """
        Pulls the data necessary for building the RAPM matrix from the game's possessions.
        
        Args:
            game: A pbpstats Game object
        """
        # each game object has all of the possessions, which are accessed via `items`
        for possession in game.possessions.items:
            # we want the event data for each possession
            for possession_event in possession.events:
                # we only want the event data if it is the event that ended the possession
                if isinstance(possession_event, enhanced_pbp_item.EnhancedPbpItem) and possession_event.is_possession_ending_event:
                    # add all relevant data to our dataclass
                    self.current_players.append(possession_event.current_players)
                    self.pbp_scores.append(possession_event.score)
                    team_id = possession_event.get_offense_team_id()
                    if team_id in [0, 1]:
                        # happens with Excess Timeout Error
                        team_id = possession.offense_team_id
                    self.pbp_offense_team.append(team_id)

    def transform_rapm_possessions(self, game_id: str):
        """
        Uses the RAPM data to build the matrix that will be used in calculating RAPM.
        
        Args:
            game_id: The ID assigned to the game by the NBA
        """
        for idx, (lineups, offense_team, scores) in enumerate(zip(self.current_players, self.pbp_offense_team, self.pbp_scores)):
            prev_score = self.pbp_scores[idx-1][offense_team] if idx != 0 else 0
            pts_scored = scores[offense_team] - prev_score
            poss = list()
            defense_team = [team for team in lineups.keys() if team != offense_team].pop()
            try:
                # Occasionally the index is off. If error is thrown, need to manually
                # find the solution and add to the relevant overrides file.
                poss.extend(lineups[offense_team])
            except KeyError:
                print(game_id, idx, lineups, offense_team)
            poss.extend(lineups[defense_team])
            poss.append(1)
            poss.append(pts_scored)
            self.rapm_poss.append(poss)
        
        rapm_cols = [
            "offensePlayer1Id", "offensePlayer2Id", "offensePlayer3Id", "offensePlayer4Id", "offensePlayer5Id",
            "defensePlayer1Id", "defensePlayer2Id", "defensePlayer3Id", "defensePlayer4Id", "defensePlayer5Id",
            "possession",
            "points"
        ]
        self.rapm_df = pd.DataFrame(self.rapm_poss, columns=rapm_cols)

These functions are the pipeline for pulling the data via `pbpstats` cleanly and without errors. The first function, `get_rapm_game()`, implements retry handling and backs off of requests by increasing the sleep time in between calls if a `ReadTimeout` is encountered. Other exceptions handled by `get_rapm_season_data()` include several that are thrown by `pbpstats` when an issue with the play by play is encountered. Although I didn't use formal logging to track these "bad games", they are saved in a `bad_games` dictionary with the game ID as the key and the error message as the value. As mentioned earlier, you can find ways to manually fix them in the [package wiki](https://github.com/dblackrun/pbpstats/wiki/Fixing-issues-with-raw-play-by-play).

In [7]:
def get_rapm_game(client: pbpstats.client.Client, game_id: str, sleep_time: float, num_retries: int) -> pbpstats.objects.game.Game:
    """Try getting the Game object, with some attempts at retry handling"""
    time.sleep(sleep_time)
    if num_retries >= 10:
        return None
    try:
        return client.Game(game_id)
    except ReadTimeout:
        # retry pulling the game
        return get_rapm_game(game_id, 3*sleep_time, num_retries + 1)

def get_rapm_season_data(client: pbpstats.client.Client, season: str) -> Tuple[pd.DataFrame, List]:
    """
    Using a pbpstats client object, pull game data for the specified season and
    then get and transform possession level data so it can be fed into an RAPM
    model.
    
    Args:
      client: A pbpstats Client object
      season: The season to pull data, formatted as '2020-21' for the 2021 season
    """
    season = client.Season("nba", season, "Regular Season")
    rapm_games = list()
    bad_games = dict()
    
    for game in tqdm(season.games.final_games):
        sleep_time = 0.75
        num_retries = 0
        time.sleep(sleep_time)
        game_id = game['game_id']
        try:
            game_data = client.Game(game_id)
        except (TeamHasBackToBackPossessionsException, EventOrderError, InvalidNumberOfStartersException) as e:
            # first two exceptions happen when using stats_nba as data provider
            # third exception happens when pulling new data with data_nba as data provider
            # and the event hasn't been added to the missing_starters.json file
            bad_games[game_id] = str(e)
        except ReadTimeout:
            # Only seems to occur with stats_nba as data provider (and web as source)
            print("Read timeout encountered, using retries and backoff.")
            game_data = get_rapm_game(client, game_id, sleep_time, num_retries)
        if game_data:
            rapm_game_data = gameDataRAPM()
            rapm_game_data.get_game_rapm_possessions(game_data)
            rapm_game_data.transform_rapm_possessions(game_id)
            rapm_games.append(rapm_game_data.rapm_df)
    
    season_rapm_data = pd.concat(rapm_games)
    return season_rapm_data, bad_games

That's all there is to it. Now just run the process from end to end with the function and you'll have RAPM data for the given year.

**Note that it takes 25-30 minutes to run for each season when using `data_nba` and about 1.5 hrs when using `stats_nba`.**

In [11]:
rapm_season_data, bad_games = get_rapm_season_data(client, "2020-21")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1080/1080 [21:37<00:00,  1.20s/it]


Testing to make sure no errors were encountered while pulling the games.

In [12]:
assert len(bad_games) == 0

If you want to calculate RAPM data for multiple seasons, all you have to do is run the same process for each year, add each year's dataframe to a list, and use `pd.concat()` on the list of dataframes to turn it into one large dataframe. I would recommend adding a `season` column first so you can more easily split the data by season if necessary.

Now off to Ryan's [RAPM tutorial](https://github.com/rd11490/NBA_Tutorials/tree/master/rapm)!