# How to Adjust NBA Team Rating for Strength of Schedule (SoS)
## Using an RAPM styled approach
This tutorial goes through my process of adjusting NBA Teams' Offensive and Defensive Ratings for strength of schedule (SoS).
First read my [**blog post**](https://blog.sradjoker.cc/posts/nba-sosadj/) on the same topic, before going any further. The blog post explains the details including the math necessary for understanding the code in this tutorial.  

The code for RAPM styled approach is adopted from [Ryan Davis' RAPM Tutorial](https://github.com/rd11490/NBA_Tutorials/tree/master/rapm) and I suggest you read that tutorial before continuing. 

First let's import the necessary packages to run this notebook

In [1]:
import pandas as pd  # for processing data
import numpy as np  # for numerical operations on arrays
from tqdm import tqdm  # gives up progress bar
import time  # for time related stuff
from sklearn.linear_model import RidgeCV

# don't raise warnings when chaining pandas operations
pd.options.mode.chained_assignment = None

Then we will load the team information as two variable. There are 30 teams in the NBA and each team has a name and a team ID
1. `teams_list` will have a list of all team IDs
2. `teams_dict` is a dictionary mapping the team IDs to the team names.

In [2]:
team_data = pd.read_csv("../data/NBA_teams_database.csv")
teams_list = team_data["TeamID"].tolist()
team_dict1 = team_data.to_dict(orient="records")
teams_dict = {team["TeamID"]: team["Team"] for team in team_dict1}

## Scraping the Data Required 
This section will cover the scraping part of the tutorial. You can skip the tutorial and go to the next section if you wish so. The data has already been scraped and is available for the 2023-24 season in the [data folder](./data/NBA_BoxScores_Adv_2023.csv).  
We will be using the `nba_api` to get the necessary data. It should be installed already if you followed the instructions in [Readme](../README.md).
The team ratings i.e. offensive, defensive and net ratings can be found for each game by using the `boxscoreadvancedv3` endpoint. This endpoint needs needs the `GameID` to get the boxscores for both teams in that game. To get `GameIDs` for all games played in the 2023-24 season, we will use the `leaguegamelog` endpoint.

In [3]:
from nba_api.stats.endpoints import leaguegamelog, boxscoreadvancedv3

# for 2023-24 season
season = "2023"
# get the information
stats = leaguegamelog.LeagueGameLog(
    player_or_team_abbreviation="T",
    season=season,
    season_type_all_star="Regular Season",
)
# output the information as pandas dataframe
df = stats.get_data_frames()[0]
# get the GameIDs as a list
game_ids = df["GAME_ID"].tolist()
# GameIDs are repeated twich, once for home team and once for away team
# We can use numpy unique to remove the duplicates
game_ids = np.unique(game_ids)

Now we have a list of `game_ids` to use in `boxscoreadvancedv3` endpoint. We just put the `game_ids` in a `for` loop to get the data for each game as a dataframe. We append the generated dataframe for each game to a list of dataframes `dfa`. Finally we can use `pandas.concat` to concatenate all the dataframes into a single dataframe for the season.  
This process might take a while (10-20 minutes, depending on the number of games played), so grab a coffee or a snack and come back after some time.
There is a small (maybe big) issue, if you just run a vanilla `for` loop. The `stats.nba.com` endpoint we use to scrape the data, times out when requested too many times in a short period of time and results in a error:
```
HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
```
Any error will stop the `for` loop and we have to repeat again. To prevent this issue, we wrap the call to the endpoint in `try` `except` blocks and retry the endpoint for that `gameId` till it succeeds.  
I found an elegant solution for this issue while creating this tutorial which is to use the [`tenacity`](https://github.com/jd/tenacity) package.
1. We import the necessary modules from tenacity:
   1. `retry`: decorator to enable retries on the function
   2. `stop_after_attempt`: to define the maximum number of attempts. I set it as `5`
   3. `wait_fixed`: to wait for a certain amount of fixed time before retrying. The number I use is `0.6` seconds [as recommended](https://github.com/swar/nba_api/issues/176) by the authors of the `nba_api`

In [4]:
from tenacity import retry
from tenacity.stop import stop_after_attempt
from tenacity.wait import wait_fixed

2. We add the `retry` decorator with the necessary options to the `get_boxscores` function, which has the `try` `except` block to handle errors

In [5]:
@retry(stop=stop_after_attempt(5), wait=wait_fixed(0.6))
def get_boxscores(game_id):
    try:
        stats = boxscoreadvancedv3.BoxScoreAdvancedV3(game_id=game_id)
        df1 = stats.get_data_frames()[1]
    except Exception as error:
        print(error)
    return df1

3. Now we run the `for` loop with the decorated `get_boxscores` function. Finally, we save the scraped data as a `csv` file in the data folder. 

In [6]:
dfa = []
for game_id in tqdm(game_ids):
    df1 = get_boxscores(game_id)
    dfa.append(df1)
df = pd.concat(dfa)
df.to_csv(f"./data/NBA_BoxScores_Adv_{season}.csv")

 13%|█▎        | 49/363 [00:59<06:26,  1.23s/it]

HTTPSConnectionPool(host='stats.nba.com', port=443): Max retries exceeded with url: /stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID=0022300050&RangeType=0&StartPeriod=0&StartRange=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000207782F5090>, 'Connection to stats.nba.com timed out. (connect timeout=30)'))


100%|██████████| 363/363 [05:27<00:00,  1.11it/s]


## Loading and Pre-Processing the Data
Now lets load the data. The data has a lot of columns we don't use. So to we import only the data necessary by using the `usecols` option in `pandas.read_csv()`.

In [7]:
season = "2023"
cols = [
    "gameId",
    "teamName",
    "teamId",
    "offensiveRating",
    "defensiveRating",
    "netRating",
    "possessions",
]
df = pd.read_csv(f"./data/NBA_BoxScores_Adv_{season}.csv", usecols=cols)
cols = ["gameId", "tId", "team", "ORtg", "DRtg", "NRtg", "poss"]
df.columns = cols
df.head(4)

Unnamed: 0,gameId,tId,team,ORtg,DRtg,NRtg,poss
0,22300001,1610612754,Pacers,118.6,112.6,6.0,102.0
1,22300001,1610612739,Cavaliers,112.6,118.6,-6.0,103.0
2,22300002,1610612749,Bucks,110.0,104.0,6.0,100.0
3,22300002,1610612752,Knicks,104.0,110.0,-6.0,101.0


As you see the printed table, each `gameId` has two entries, one of each team in the game. Each row has only the information for that team. But what we need is a combined row entry with the opponent information also.  
We will use `pandas.groupby` to achieve that. The variable to apply the operation will be `gameId`. This operation will create a `groupby` object, on which further operations can be run.

In [8]:
df1 = df.groupby("gameId")
df1

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000207787B7710>

We then use the `nth` operation to get the 1st and 2nd rows of each game.


In [9]:
df1_1 = df1.nth(0)
df1_2 = df1.nth(1)
display(df1_1.head(2))
display(df1_2.head(2))

Unnamed: 0,gameId,tId,team,ORtg,DRtg,NRtg,poss
0,22300001,1610612754,Pacers,118.6,112.6,6.0,102.0
2,22300002,1610612749,Bucks,110.0,104.0,6.0,100.0


Unnamed: 0,gameId,tId,team,ORtg,DRtg,NRtg,poss
1,22300001,1610612739,Cavaliers,112.6,118.6,-6.0,103.0
3,22300002,1610612752,Knicks,104.0,110.0,-6.0,101.0


We can then rename the columns of the 1st dataframe, adding `1` to all its column names, except the `gameId` column (which is needed for the merging operation later). For the 2nd dataframe, similarly add `2` to the columns names.

In [10]:
df1_1.columns = ["gameId"] + [s + "1" for s in df1_1.columns if s != "gameId"]
df1_2.columns = ["gameId"] + [s + "2" for s in df1_2.columns if s != "gameId"]
display(df1_1.head(2))
display(df1_2.head(2))

Unnamed: 0,gameId,tId1,team1,ORtg1,DRtg1,NRtg1,poss1
0,22300001,1610612754,Pacers,118.6,112.6,6.0,102.0
2,22300002,1610612749,Bucks,110.0,104.0,6.0,100.0


Unnamed: 0,gameId,tId2,team2,ORtg2,DRtg2,NRtg2,poss2
1,22300001,1610612739,Cavaliers,112.6,118.6,-6.0,103.0
3,22300002,1610612752,Knicks,104.0,110.0,-6.0,101.0


We then merge the two dataframes `df1_1` and `df1_2` on the column `gameId`, generating the dataframe we need.

In [11]:
df1_3 = pd.merge(df1_1, df1_2, on="gameId")
display(df1_3.head(2))

Unnamed: 0,gameId,tId1,team1,ORtg1,DRtg1,NRtg1,poss1,tId2,team2,ORtg2,DRtg2,NRtg2,poss2
0,22300001,1610612754,Pacers,118.6,112.6,6.0,102.0,1610612739,Cavaliers,112.6,118.6,-6.0,103.0
1,22300002,1610612749,Bucks,110.0,104.0,6.0,100.0,1610612752,Knicks,104.0,110.0,-6.0,101.0


One more step remaining. What we have right now is one row of each game. But, what we need is two rows for each game as described in my [blog post](https://blog.sradjoker.cc/posts/nba-sosadj/#modified-srs-method). To get that dataframe, we repeat the process above, with `0` and `1` flipped when performing the `nth` operation. Finally we merge the two dataframes `df1_3` and `df1_6`, to get the combined dataframe with two rows for each game.

In [12]:
df1_4 = df1.nth(1)
df1_5 = df1.nth(0)
df1_4.columns = ["gameId"] + [s + "1" for s in df1_4.columns if s != "gameId"]
df1_5.columns = ["gameId"] + [s + "2" for s in df1_5.columns if s != "gameId"]
df1_6 = pd.merge(df1_4, df1_5, on="gameId")
df2 = pd.concat([df1_3, df1_6]).sort_values(by="gameId").reset_index(drop=True)
data = df2.copy()
data.head(4)

Unnamed: 0,gameId,tId1,team1,ORtg1,DRtg1,NRtg1,poss1,tId2,team2,ORtg2,DRtg2,NRtg2,poss2
0,22300001,1610612754,Pacers,118.6,112.6,6.0,102.0,1610612739,Cavaliers,112.6,118.6,-6.0,103.0
1,22300001,1610612739,Cavaliers,112.6,118.6,-6.0,103.0,1610612754,Pacers,118.6,112.6,6.0,102.0
2,22300002,1610612752,Knicks,104.0,110.0,-6.0,101.0,1610612749,Bucks,110.0,104.0,6.0,100.0
3,22300002,1610612749,Bucks,110.0,104.0,6.0,100.0,1610612752,Knicks,104.0,110.0,-6.0,101.0


# Processing the Data
To process the data in a format required by the Ridge Regression algorithm `RidgeCV`, we define the following functions:
## maps_teams()
1. Makes the matrix rows to be used in ridge regression
2. The weights for each team = 1/2
3. Equations per game are:  
$$\frac{1}{2}\hat{Team}^1_{OFF} + \frac{1}{2}\hat{Team}^2_{DEF} = Team^1_{OFF} $$
$$\frac{1}{2}\hat{Team}^2_{OFF} + \frac{1}{2}\hat{Team}^1_{DEF} = Team^2_{OFF} $$
4. The reason for doing this is that for unadjusted values of a game:
$$ Team^1_{OFF} = Team^2_{DEF} $$  
5. So,
$$ Team^1_{OFF} = 0.5\times Team^1_{OFF} + 0.5\times Team^2_{DEF} $$
6. Therefore I use a similar structure for estimating adjusted ratings


In [13]:
def map_teams(row_in, teams, scale):
    t1 = row_in[0]
    t2 = row_in[1]

    rowOut = np.zeros([len(teams) * 2])
    rowOut[teams.index(t1)] = scale
    rowOut[teams.index(t2) + len(teams)] = scale

    return rowOut

## convert_to_matrices()
1. Converts each row of data dataframe to x stints.
2. Then maps those rows using `map_teams` function to get matrix X rows
3. Gets Y rows. Here Y is `ORtg1` i.e. we are trying to predict the offensive rating of the 1st team for every row

In [14]:
def convert_to_matricies(possessions, name, teams, scale=1):
    # extract only the columns we need
    # Convert the columns of player ids into a numpy matrix
    stints_x_base = possessions[["tId1", "tId2"]].to_numpy()
    # Apply our mapping function to the numpy matrix
    stint_X_rows = np.apply_along_axis(map_teams, 1, stints_x_base, teams, scale=scale)
    # Convert the column of target values into a numpy matrix
    stint_Y_rows = possessions[name].to_numpy()

    # return matricies and possessions series
    return stint_X_rows, stint_Y_rows

## lambda_to_alpha()
- In stats world (`R`), `glmnet()` is used for Ridge Regression and uses the parameter $\lambda$. Most the NBA stats people use this parameter $\lambda$ for discussing the regularization parameter. But `sklearn.linear_model.RidgeCV()` has a parameter $\alpha$, which isn't the same. 
- So we need to convert $\lambda$ to $\alpha$ needed for Ridge CV. [More details here](https://stats.stackexchange.com/questions/160096/what-are-the-differences-between-ridge-regression-using-rs-glmnet-and-pythons)

In [15]:
def lambda_to_alpha(lambda_value, samples):
    return (lambda_value * samples) / 2.0

## calculate_netrtg()
1. Converts lambdas to alphas using `lambda_to_alpha` function
2. Defines the ridge regression problem using `scikit-learn`'s `RidgeCV` algorithm
3. `cv=5` is chosen i.e. k-fold cross-validation splitting strategy using `k=5`
4. `Intercept` is set as true. This value is to be added later to our estimation results to get Offensive and Defensive ratings.
5. Gets coefficients and intercept
6. Add intercept to intercept to get adjusted ratings. Use adjusted off and def ratings to calculate adjusted net rating.
7. Create and return adjusted ratings dataframe

In [16]:
def calculate_netrtg(train_x, train_y, lambdas, teams_list):
    alphas = [lambda_to_alpha(l, train_x.shape[0]) for l in lambdas]
    # create a 5 fold CV ridgeCV model. Our target data is not centered at 0, so we want to fit to an intercept.
    clf = RidgeCV(alphas=alphas, cv=5, fit_intercept=True)

    # fit our training data
    model = clf.fit(
        train_x,
        train_y,
    )

    # convert our list of players into a mx1 matrix
    team_arr = np.transpose(np.array(teams_list).reshape(1, len(teams_list)))

    # extract our coefficients into the offensive and defensive parts
    coef_offensive_array = model.coef_[0 : len(teams_list)][np.newaxis].T
    coef_defensive_array = model.coef_[len(teams_list) : 2 * len(teams_list)][
        np.newaxis
    ].T
    # concatenate the offensive and defensive values with the playey ids into a mx3 matrix
    team_id_with_coef = np.concatenate(
        [team_arr, coef_offensive_array, coef_defensive_array], axis=1
    )
    # build a dataframe from our matrix
    teams_coef = pd.DataFrame(team_id_with_coef)
    intercept = model.intercept_
    teams_coef.columns = ["tId", "aOFF", "aDEF"]
    teams_coef["aNET"] = teams_coef["aOFF"] - teams_coef["aDEF"]
    teams_coef["aOFF"] = teams_coef["aOFF"] + intercept
    teams_coef["aDEF"] = teams_coef["aDEF"] + intercept
    teams_coef["Team"] = teams_coef["tId"].map(teams_dict)
    results = teams_coef[["tId", "Team", "aOFF", "aDEF", "aNET"]]
    results = results.sort_values(by=["aNET"], ascending=False).reset_index(drop=True)
    return results, model, intercept

# Estimating Adjusted Ratings
Next, we run the functions defined above to generated the adjusted ratings

In [17]:
train_x, train_y = convert_to_matricies(data, "ORtg1", teams_list, scale=0.5)
lambdas_net = [0.015, 0.075, 0.15]
results_adj, model, intercept = calculate_netrtg(
    train_x, train_y, lambdas_net, teams_list
)
print(f"Intercept = {intercept}")

Intercept = 114.2197043446658


The intercept here can be interpreted as the league average offensive/defensive rating.
Here are the adjusted ratings.

In [18]:
results_adj

Unnamed: 0,tId,Team,aOFF,aDEF,aNET
0,1610613000.0,Philadelphia 76ers,121.065207,110.772873,10.292335
1,1610613000.0,Boston Celtics,118.828331,108.764236,10.064095
2,1610613000.0,Oklahoma City Thunder,117.70269,110.814878,6.887812
3,1610613000.0,Minnesota Timberwolves,113.207243,106.62844,6.578803
4,1610613000.0,Denver Nuggets,118.395144,113.090987,5.304157
5,1610613000.0,LA Clippers,115.366218,111.06489,4.301329
6,1610613000.0,Orlando Magic,113.446035,109.345141,4.100894
7,1610613000.0,New York Knicks,117.214095,113.29121,3.922885
8,1610613000.0,Houston Rockets,111.781191,107.967128,3.814063
9,1610613000.0,Milwaukee Bucks,118.657846,115.338466,3.319381


# Finishing Touches
We're not done yet. Now we need to compare the adjusted ratings with the unadjusted ones. But, we haven't calculated the unadjusted ratings yet. Let's do it now.

For a single game:
$$ PTS_{OFF}*100 = ORtg^1 \times poss^1 $$
$$ PTS_{DEF}*100 = DRtg^1 \times poss^1 $$

Applying these operations on the `data` dataframe:

In [19]:
data["pts_off"] = data["ORtg1"] * data["poss1"]
data["pts_def"] = data["DRtg1"] * data["poss1"]

We have to use the `groupby` operation again, now on the `tId1` column. After the `groupby` operation, we chain an `agg` (aggregate) operation, which applies a function on all rows of the group. The function we chose here is `sum`, which adds all the `pts` and and `poss` for a team.

In [20]:
off_p = data.groupby(["tId1"])[["poss1", "pts_off"]].agg("sum").reset_index()
def_p = data.groupby(["tId1"])[["poss1", "pts_def"]].agg("sum").reset_index()

The unadjusted team ratings would then be:
$$ OFF = \frac{PTS_{OFF}^{Total}}{poss^{Total}} $$ 
$$ DEF = \frac{PTS_{DEF}^{Total}}{poss^{Total}} $$ 

In [21]:
off_p["OFF"] = off_p["pts_off"] / off_p["poss1"]
off_p = off_p[["tId1", "OFF"]]
def_p["DEF"] = def_p["pts_def"] / def_p["poss1"]
def_p = def_p[["tId1", "DEF"]]

We then merge these ratings to the `results_adj` dataframe

In [22]:
results_net = pd.merge(off_p, def_p, on=["tId1"])
results_net["NET"] = results_net["OFF"] - results_net["DEF"]
results_net.rename(columns={"tId1": "tId"}, inplace=True)
results_net = results_net.astype(float).round(2)
results_net["tId"] = results_net["tId"].astype(int)
results_adj["tId"] = results_adj["tId"].astype(int)
results_comb = pd.merge(results_net, results_adj, on=["tId"])
results_comb["aOFF"] = results_comb["aOFF"]
results_comb["aDEF"] = results_comb["aDEF"]
results_comb["oSOS"] = results_comb["aOFF"] - results_comb["OFF"]
results_comb["dSOS"] = results_comb["DEF"] - results_comb["aDEF"]
results_comb["SOS"] = results_comb["oSOS"] + results_comb["dSOS"]
results_comb.iloc[:, 1:] = results_comb.iloc[:, 1:].round(1)
results = results_comb[
    ["Team", "OFF", "oSOS", "aOFF", "DEF", "dSOS", "aDEF", "NET", "SOS", "aNET"]
]
results = results.sort_values(by="aNET", ascending=0).reset_index(drop=True)
results.index = results.index + 1

## Final Combined Data table:
You can save it as `csv` file and then you some fancy visualization tool to create a [pretty looking table](https://twitter.com/SravanNBA/status/1725722980159045792) and/or [efficiency landscape graph](https://twitter.com/SravanNBA/status/1727377558176661774)

In [23]:
results

Unnamed: 0,Team,OFF,oSOS,aOFF,DEF,dSOS,aDEF,NET,SOS,aNET
1,Philadelphia 76ers,121.2,-0.1,121.1,110.9,0.2,110.8,10.3,0.0,10.3
2,Boston Celtics,118.3,0.5,118.8,109.6,0.9,108.8,8.7,1.4,10.1
3,Oklahoma City Thunder,117.6,0.1,117.7,110.6,-0.2,110.8,7.0,-0.1,6.9
4,Minnesota Timberwolves,113.3,-0.1,113.2,106.6,-0.1,106.6,6.7,-0.2,6.6
5,Denver Nuggets,117.3,1.1,118.4,112.6,-0.5,113.1,4.7,0.6,5.3
6,LA Clippers,115.4,-0.1,115.4,110.6,-0.5,111.1,4.8,-0.5,4.3
7,Orlando Magic,113.8,-0.4,113.4,109.5,0.2,109.3,4.3,-0.2,4.1
8,New York Knicks,117.3,-0.1,117.2,113.3,0.0,113.3,4.0,-0.1,3.9
9,Houston Rockets,111.4,0.3,111.8,107.4,-0.5,108.0,4.0,-0.2,3.8
10,Milwaukee Bucks,119.3,-0.7,118.7,115.7,0.4,115.3,3.6,-0.3,3.3
