# Who guards the guards?

NBA teams value lockdown defenders -- a player who can keep their opponent's star from controlling the game. Although some defensive specialists are widely recognized around the league, and may play extensively despite limited offensive games, others may be overlooked.

In this notebook, I'll use publicly-released data from the NBA to identify the players most frequently tasked with challenging defensive assignments -- limited to guards for now -- and look at some related questions, like:
- Does top-defender-dom persist from year to year?
- How are teams with two important offensive players defended? How do teams with two top defenders assign them?
- How do these matchups change in the playoffs?

## What's our universe of "important offensive players"?

We're going to start by using a Usage leaderboard, which reflects the proportion of a team's possessions in which that player was last to touch the ball (either by shooting it or turning it over). Because turnovers are most likely when a player is either dribbling the ball or passing the ball, this component is a reasonable approximation for players who spend the most time handling the ball, even if they don't shoot as often themselves.

We could consider incorporating Assist Ratio, which is the proportion of possessions for which that player receives credit for an assist (the last pass leading directly to a made shot), but assists are noisier than the components of usage. For one, assists are determined subjectively by official scorekeepers on the basis of whether that last pass was sufficiently proximate to the shot -- scorekeepers are tied to an arena and have been demonstrated to show a bias in awarding more assists to the home team. In addition, two passes of equal quality will not be treated identically because assists are only awarded if the shot is made, so a miss (or a shooting foul drawn) can't be assisted. As a result, Assist Ratio is dependent on whether the game is home or on the road, and on the shooting ability of a player's teammate (and to a smaller extent on the skill of the defender guarding that teammate). So we'll set it aside for now.

In addition, we'll focus on Guards for now -- we want a relatively homogeneous pool of offensive players so that a standout defender is likely to be matched up against most or all of them. In particular, a player who can match up against a point guard could also handle other perimeter players but not necessarily centers.

In [1]:
# Our first step will be to pull a leaderboard for Usage from stats.nba.com and turn it into a pandas dataframe.
# Here, I'm following the workflow helpfully laid out by Greg Reda (http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/)
# and Savvas Tjortjoglu (http://savvastjortjoglou.com/nba-shot-sharts.html) that they used to obtain other sets of stats from the same site.

import requests
import pandas as pd
import numpy as np
import seaborn as sns
from time import sleep
%matplotlib inline

In [2]:
# we'll save the URL as a string first

# this gets us a regular-season data from 2018-19 in JSON format; 
# the MeasureType=Advanced parameter gets us the Usage stat, among others
usage_url = 'https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country='+ \
                '&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height='+ \
                '&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0'+ \
                '&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience='+ \
                '&PlayerPosition=G&PlusMinus=N&Rank=N&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season'+ \
                '&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight='

The server won't accept the request using the default parameters from requests.get(), so we need to send what it sees when I load the page manually (the headers).
I'm not super-confident how this conforms to the TOS for the NBA Stats site, so I'm going to endeavor to send a minimal number of GET requests, at least, no more than I would use when playing around with the full site.

In [3]:
http_headers = {'Accept': 'application/json', 'x-nba-stats-token': 'true', 'X-NewRelic-ID': 'VQECWF5UChAHUlNTBwgBVw==',
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131',
                'x-nba-stats-origin': 'stats', 'Referer': 'https://stats.nba.com/players/advanced/?sort=USG_PCT&dir=-1&CF=GP*G*5:MIN*G*20&Season=2018-19&SeasonType=Regular%20Season'}

usage_output = requests.get(usage_url, headers=http_headers)


In [26]:
# now take that JSON output and turn it into a dataframe
headers1 = usage_output.json()['resultSets'][0]['headers']
players1 = usage_output.json()['resultSets'][0]['rowSet']

usage_df = pd.DataFrame(players1, columns=headers1)

print(usage_df.shape)
print(list(usage_df))

(262, 73)
['PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION', 'AGE', 'GP', 'W', 'L', 'W_PCT', 'MIN', 'eOFF_RATING', 'OFF_RATING', 'sp_work_OFF_RATING', 'eDEF_RATING', 'DEF_RATING', 'sp_work_DEF_RATING', 'eNET_RATING', 'NET_RATING', 'sp_work_NET_RATING', 'AST_PCT', 'AST_TO', 'AST_RATIO', 'OREB_PCT', 'DREB_PCT', 'REB_PCT', 'TM_TOV_PCT', 'EFG_PCT', 'TS_PCT', 'USG_PCT', 'ePACE', 'PACE', 'sp_work_PACE', 'PIE', 'FGM', 'FGA', 'FGM_PG', 'FGA_PG', 'FG_PCT', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'eOFF_RATING_RANK', 'OFF_RATING_RANK', 'sp_work_OFF_RATING_RANK', 'eDEF_RATING_RANK', 'DEF_RATING_RANK', 'sp_work_DEF_RATING_RANK', 'eNET_RATING_RANK', 'NET_RATING_RANK', 'sp_work_NET_RATING_RANK', 'AST_PCT_RANK', 'AST_TO_RANK', 'AST_RATIO_RANK', 'OREB_PCT_RANK', 'DREB_PCT_RANK', 'REB_PCT_RANK', 'TM_TOV_PCT_RANK', 'EFG_PCT_RANK', 'TS_PCT_RANK', 'USG_PCT_RANK', 'ePACE_RANK', 'PACE_RANK', 'sp_work_PACE_RANK', 'PIE_RANK', 'FGM_RANK', 'FGA_RANK', 'FGM_PG_RANK', 'FGA_PG_RANK', 'FG_




This has the stats we want, but includes plenty of players we aren't interested in, so we'll apply a couple filters to eliminate guys who played a limited number of games (or minutes per game), and then pare down to the Usage leaders based on a threshold -- between 20% (if each team shared the ball perfectly) and 30% (the 10th-highest player in the league).

In [7]:
# now we'll implement our filters -- a threshold of 24% gives us 30 players, an average of one per team
usage_leaders = usage_df.loc[(usage_df['MIN'] >= 24.0) & (usage_df['GP'] >= 20) & (usage_df['USG_PCT'] >= .24)]

print(usage_leaders.shape)
print(usage_leaders.head())

(30, 73)
    PLAYER_ID       PLAYER_NAME     TEAM_ID TEAM_ABBREVIATION   AGE  GP   W  \
19     203078      Bradley Beal  1610612764               WAS  25.0  82  32   
25    1627741       Buddy Hield  1610612758               SAC  26.0  82  39   
27     203468       CJ McCollum  1610612757               POR  27.0  70  43   
35    1629012     Collin Sexton  1610612739               CLE  20.0  82  19   
39    1626156  D'Angelo Russell  1610612751               BKN  23.0  81  42   

     L  W_PCT   MIN         ...          PACE_RANK  sp_work_PACE_RANK  \
19  50  0.390  36.9         ...                127                127   
25  43  0.476  31.9         ...                 64                 64   
27  27  0.614  33.9         ...                171                171   
35  63  0.232  31.8         ...                246                246   
39  39  0.519  30.2         ...                114                114   

    PIE_RANK  FGM_RANK  FGA_RANK  FGM_PG_RANK  FGA_PG_RANK  FG_PCT_RANK  CFID

# Who guards those players?

In [63]:
# now we take the list of important offensive players' player IDs
offensive_list = usage_leaders['PLAYER_ID'].tolist()

print(offensive_list[0:5])
print(len(offensive_list))
# note that these appear as integers, and they're in first-name alpha order

[203078, 1627741, 203468, 1629012, 1626156]
30


Let's start with one important offensive player (possibly the most important offensive player, certainly as judged by Usage), James Harden, and see what we're looking at.

In [25]:
# and use it as the source for a new query to stats.nba.com to get the list of players they matched up against
# we'll start with a single example (James Harden) and then generalize to the list

matchup_url = 'https://stats.nba.com/stats/leagueseasonmatchups?DateFrom=&DateTo=&LeagueID=00&OffPlayerID=' + \
              str(201935) + '&Outcome=&PORound=0&PerMode=Totals&Season=2018-19&SeasonType=Regular+Season'

http_headers2 = {'Accept': 'application/json', 'x-nba-stats-token': 'true', 'X-NewRelic-ID': 'VQECWF5UChAHUlNTBwgBVw==',
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131',
                'x-nba-stats-origin': 'stats', 'Referer': 'https://stats.nba.com/player/' + str(201935) + '/matchups/?Season=2018-19&SeasonType=Regular%20Season&PerMode=Totals'}

matchup_output = requests.get(matchup_url, headers=http_headers2)

headers2 = matchup_output.json()['resultSets'][0]['headers']
players2 = matchup_output.json()['resultSets'][0]['rowSet']

harden_df = pd.DataFrame(players2, columns=headers2)

print(harden_df.head())
print("James harden played 78 games this year, and we have "+ str(harden_df['POSS'].sum()) +\
      " possessions worth of matchup data for him, or "+ (str(harden_df['POSS'].sum()/78))[:5] +" per game.")

   OFF_TEAM_ID OFF_TEAM_ABBREVIATION OFF_TEAM_CITY OFF_TEAM_NICKNAME  \
0   1610612745                   HOU       Houston           Rockets   
1   1610612745                   HOU       Houston           Rockets   
2   1610612745                   HOU       Houston           Rockets   
3   1610612745                   HOU       Houston           Rockets   
4   1610612745                   HOU       Houston           Rockets   

   OFF_PLAYER_ID OFF_PLAYER_NAME  DEF_TEAM_ID DEF_TEAM_ABBREVIATION  \
0         201935    James Harden   1610612760                   OKC   
1         201935    James Harden   1610612740                   NOP   
2         201935    James Harden   1610612762                   UTA   
3         201935    James Harden   1610612742                   DAL   
4         201935    James Harden   1610612756                   PHX   

   DEF_TEAM_CITY DEF_TEAM_NICKNAME    ...      FGA  FGA_DIFF  FG_PCT  FG3M  \
0  Oklahoma City           Thunder    ...       39  0.795922  

This looks good! James Harden played 4 games against OKC, New Orleans, Dallas, and Utah, and 3 against Phoenix, so he likely faced those teams' primary defenders most often. In addition, OKC and New Orleans have individual defenders who are very highly-regarded in Paul George and Jrue Holiday (both of whom matched up the most against Harden). Utah has a top-rated team defense but their top defender plays at center, while Phoenix is near the bottom of the league -- the shooting percentage (FG_PCT) and 3-point percentage (FG3_PCT) fields seem to roughly agree with the NBA landscape.

In addition, we can look at the bottom line of the printout -- showing we have about 74 possessions per game worth of matchup data -- and see that this API output is **complete**. James Harden averaged a little more than 36 minutes per game (and games are 48 minutes long, plus a few overtime games per season), and NBA teams play roughly 100 possessions each over the course of a game, with Harden's Houston a little slower than league average (97.9). So this is exactly what we would expect, for a dataset that includes every possession where the offensive player is on the floor, whether he touches the ball or not.

### Next:

Now, we generalize the above matchup-generating process over each offensive player in our list, to get a set of defenders who have faced at least one offensive player a non-trivial number of times.

In [10]:
# this definitely doesn't feel pythonic, so let's assume it'll get tinkered with over time
# we include some additional parameters for the season and regular vs playoffs so we can re-use the function later
def matchups(players, season='2018-19', seasontype='Regular+Season'):
    matchup_df = pd.DataFrame()

    # we iterate over the list of offensive player ids
    for id in players:
        req_url = 'https://stats.nba.com/stats/leagueseasonmatchups?DateFrom=&DateTo=&LeagueID=00&OffPlayerID=' + \
              str(id) + '&Outcome=&PORound=0&PerMode=Totals&Season='+ season +'&SeasonType='+ seasontype

        req_headers = {'Accept': 'application/json', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.9',
                'Connection': 'keep-alive', 'x-nba-stats-token': 'true', 'X-NewRelic-ID': 'VQECWF5UChAHUlNTBwgBVw==',
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131',
                'x-nba-stats-origin': 'stats', 'x-nba-stats-token': 'true', 'Referer': 'https://stats.nba.com/player/' + str(id) + '/matchups/?Season='+ season +'&SeasonType=Regular%20Season&PerMode=Totals'}
        
        req_output = requests.get(req_url, headers=req_headers)
        
        # this prevents requests from pinging the server too quickly, which may cause it to reject the connection
        sleep(.5)

        headers = req_output.json()['resultSets'][0]['headers']
        players = req_output.json()['resultSets'][0]['rowSet']

        df = pd.DataFrame(players, columns=headers)
        
        # this limits the list to defenders who faced that offensive player a non-trivial number of times
        df = df.loc[(df['POSS'] >= 20)]
        
        # this step, in particular, will be computationally slow -- should be much faster to create a bunch
        # of dataframes with obvious names and then concat them all at once as a list of names
        matchup_df = pd.concat([matchup_df, df], ignore_index=True)
    
    return matchup_df

In [11]:
# we'll call our function
matchup_df = matchups(offensive_list)

# sanity check: how many matchups were for exactly 20 possessions
matchup_df.loc[(matchup_df['POSS'] == 20)]

Unnamed: 0,OFF_TEAM_ID,OFF_TEAM_ABBREVIATION,OFF_TEAM_CITY,OFF_TEAM_NICKNAME,OFF_PLAYER_ID,OFF_PLAYER_NAME,DEF_TEAM_ID,DEF_TEAM_ABBREVIATION,DEF_TEAM_CITY,DEF_TEAM_NICKNAME,...,FGA,FGA_DIFF,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,SFL,DEF_FOULS,OFF_FOULS
92,1610612764,WAS,Washington,Wizards,203078,Bradley Beal,1610612766,CHA,Charlotte,Hornets,...,2,0.399743,0.500,0,0,0.000,2,0,0,0
178,1610612758,SAC,Sacramento,Kings,1627741,Buddy Hield,1610612744,GSW,Golden State,Warriors,...,7,1.448090,0.429,1,2,0.500,0,0,0,0
179,1610612758,SAC,Sacramento,Kings,1627741,Buddy Hield,1610612745,HOU,Houston,Rockets,...,8,1.654960,0.625,3,4,0.750,0,0,0,0
180,1610612758,SAC,Sacramento,Kings,1627741,Buddy Hield,1610612748,MIA,Miami,Heat,...,7,1.448090,0.143,1,6,0.167,3,1,1,0
251,1610612757,POR,Portland,Trail Blazers,203468,CJ McCollum,1610612740,NOP,New Orleans,Pelicans,...,4,0.794634,0.500,1,3,0.333,0,0,1,0
252,1610612757,POR,Portland,Trail Blazers,203468,CJ McCollum,1610612762,UTA,Utah,Jazz,...,3,0.595976,0.333,0,0,0.000,1,1,0,1
253,1610612757,POR,Portland,Trail Blazers,203468,CJ McCollum,1610612747,LAL,Los Angeles,Lakers,...,2,0.397317,0.500,1,2,0.500,0,0,0,0
254,1610612757,POR,Portland,Trail Blazers,203468,CJ McCollum,1610612747,LAL,Los Angeles,Lakers,...,6,1.191951,0.333,0,2,0.000,0,0,0,0
323,1610612739,CLE,Cleveland,Cavaliers,1629012,Collin Sexton,1610612759,SAS,San Antonio,Spurs,...,3,0.654350,0.667,0,0,0.000,0,0,0,0
324,1610612739,CLE,Cleveland,Cavaliers,1629012,Collin Sexton,1610612763,MEM,Memphis,Grizzlies,...,5,1.090583,0.600,0,1,0.000,0,0,0,0


OK, so now we have a (very) long list of matchups between our 30 offensive players and every defender who has matched up with them on at least 20 possessions this season. We can switch our focus to the individual defenders fairly easily in pandas, by creating a groupby object.

So let's start by looking at some summary stats for our defenders to get a better feel for the matchups dataset.

In [12]:
# group and aggregate by defender (as a pandas groupby object)
defenders = matchup_df.groupby(['DEF_PLAYER_NAME'])

# explore properties of this construction of the data
defenders['GP','POSS'].agg(['count', np.sum, np.max])

Unnamed: 0_level_0,GP,GP,GP,POSS,POSS,POSS
Unnamed: 0_level_1,count,sum,amax,count,sum,amax
DEF_PLAYER_NAME,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Aaron Gordon,3,8,4,3,157,57
Aaron Holiday,2,5,3,2,57,35
Abdel Nader,1,2,2,1,20,20
Al-Farouq Aminu,3,10,4,3,100,49
Alec Burks,10,15,2,10,383,61
Alex Abrines,2,4,2,2,45,24
Alex Caruso,5,7,2,5,130,36
Allen Crabbe,6,15,4,6,225,90
Allonzo Trier,8,21,4,8,217,34
Andre Iguodala,9,24,4,9,274,47


Keeping in mind that our unit of analysis is an offensive player-defensive player matchup, both of the above COUNT columns refer to the number of offensive players (from our original list of 30) that a player matched up against. So Avery Bradley faced just about every offensive standout in the league (and it's possible one of the other two is his teammate), while Brandon Ingram (who plays small forward and often guards larger players, and who missed time this season with injuries) has matchedup against about half of them.

The SUM columns tell you how many times that offense-defense matchup actually faced each other -- because the NBA has an unbalanced schedule (with teams playing divisional and conference rivals more often), and because most players don't play every game of the year, we expect some variation here. Bradley Beal played all 82 games this season, and accrued 57 game-matchups against offensive stars (although since some of those stars may be teammates, this may not be 57 distinct games).

The MAX columns are really just a sanity check -- it's possible for two players to meet more than four times in a regular season, but it would require one or both of them to change teams (by being traded, for example). Similarly, since a star player will be on the court for 70-80 possessions per game, and defenders don't always end up matched up against their main assignment (because their rests don't align, defensive switches, or other reasons) the maximum possessions for a given matchup tops out at 40/game (e.g., for CJ McCollum), which seems reasonable.

### Next

Let's filter this list down to generate a set of 'standout defenders' who frequently take the most difficult defensive assignments. We'll use both POSS.count, which indicates a given defender has been matched up against many of our list of 30 offensive leaders, and POSS.mean, so the defender has covered them lots of times over the course of the season.

This does introduce a little bit of bias, since the unequal schedule and non-uniform distribution of offensive leaders means **some players have a larger opportunity space than others**. In other words, it's hard to identify a standout defender if they never had a star to defend... which is a problem intrinsic to this source of data.

In [16]:
# identify a threshold value to create our list of standout defenders
some_defs = defenders.filter(lambda x: (x['POSS'].count() >= 10) and (x['POSS'].mean() >= 50))

# we create a new groupby object for the filtered list
top_defs = some_defs.groupby(['DEF_PLAYER_NAME'])

top_defs['POSS'].agg(['count', 'sum'])

Unnamed: 0_level_0,count,sum
DEF_PLAYER_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Avery Bradley,28,1510
CJ McCollum,18,1092
D'Angelo Russell,11,624
D.J. Augustin,10,803
Damian Lillard,16,1027
Darren Collison,13,1011
De'Aaron Fox,20,1315
DeMar DeRozan,11,550
Derrick White,18,1060
Eric Bledsoe,19,1711


It turns out that the data is fairly sensitive to the value we choose for POSS.mean, so we're going to start with a threshold that gives us a little fewer than 30 defenders. Hopefully this list will be closely-aligned to players with the reputation for being good defenders, which would validate this methodology (of using matchups). Then we can open up some of the parameters and look for players that may have flown under the radar.

In [64]:
# is there any overlap between the lists (two-way players)?
defs_list = some_defs['DEF_PLAYER_ID'].unique().tolist()

two_way_list = set(offensive_list).intersection(set(defs_list))
print(two_way_list)

{202691, 203081, 1626156, 203468, 201942, 201950}


In [66]:
two_way_names = some_defs['DEF_PLAYER_NAME'].loc[some_defs['DEF_PLAYER_ID'].isin(two_way_list)].unique()

array(['CJ McCollum', 'Klay Thompson', 'Jrue Holiday', "D'Angelo Russell",
       'Damian Lillard', 'DeMar DeRozan'], dtype=object)

### Is this consistent from year to year?

In [None]:
# we quickly repeat the same exercise but for the 2017-18 season (without using the cutoff for defenders,
# for either season, to make the output richer)
# pair the years against each other by defensive player
# plot pairwise in a scatterplot

### Elite teammates

In [None]:
# identify cases in the single-season data where two players from the same team are both
# 1) important offensive players or 2) defensive standouts
# do their matchups look different from others?

### The Playoffs

In [None]:
# return to stats.nba.com to pull playoff data (probably 2017-18 for now)
# look to see if the following patterns hold:
# - proportion of high-usage players (since rotations shorten)
# - ability of defenders to retain their matchups (more switching)
# - new names (Iguodala)?