# NBA Data EDA

Matthew Chang

__Data Source__ \
https://www.reddit.com/r/nbadiscussion/comments/11ivrnb/recommended_websites_for_advanced_basketball/

- WOWY 
    - https://www.addmorefunds.com/nba/wowy/
    - UI to pull data
    - Can get the same data in PBP
- Crafted NBA
    - https://craftednba.com/player-roles
    - Cool UI to validate our archetype classification
- NBA RAPM
    - Mostly about player efficiency data
    - Data from 1996 to 2019
    - https://basketball-analytics.gitlab.io/rapm-data/
- Thinking Basketball
    - Costs $4
    - Could be usefull as it has data on nba skills
- NBA Stuffer
    - Has good definitions of some statistics: https://www.nbastuffer.com/analytics-101/player-evaluation-metrics/
    - Doesn't look like it has distinctive data than what exists in Master Data or PBP
- BBall Index
    - Has offense archetypes that we can use for validation
    - https://www.bball-index.com/offensive-archetypes/
- Clean the Glass
    - Try free for a week ($5 per month afterwards)
    - Has clean data already organized by zone
    - Doesn't look like it has distinctive data than what exists in Master Data or PBP

In [3]:
import os
import pandas as pd
import pyspark

### Investigating Master Data
- Key column is organized by PID (player id) and Seasons with the format {pid}_{szn}
- Data is from 2014-2015 szn to 2019-2020 season
- Data is aggreated to season levels
- Does not include playoff data

Some additonal data we could add:
- shot zone data

In [4]:
master_clutch_df = pd.read_csv("./MasterData/MasterClutch.csv", index_col=0)
master_defense_df = pd.read_csv("./MasterData/MasterDefense.csv", index_col=0)
master_misc_df = pd.read_csv("./MasterData/MasterMisc.csv", index_col=0)
master_pass_df = pd.read_csv("./MasterData/MasterPass.csv", index_col=0)
master_rebound_df = pd.read_csv("./MasterData/MasterRebound.csv", index_col=0)
master_score_df = pd.read_csv("./MasterData/MasterScore.csv", index_col=0)

master = [master_clutch_df, master_defense_df, master_misc_df, master_pass_df, master_rebound_df, master_score_df]

pid_name_map_df = pd.read_csv("./TranslationDictionaries/allnba.csv")

In [16]:
master_defense_df.head()

Unnamed: 0,pidSzn,MIN,GP,Season,STL,BLK,DEF_RIM_FGM,DEF_RIM_FGA,DEF_RIM_FG_PCT,FREQ,...,D_FGA,D_FG_PCT,PCT_PLUSMINUS,OPP_PTS_2ND_CHANCE,OPP_PTS_FB,OPP_PTS_PAINT,DEFLECTIONS,CHARGES_DRAWN,CONTESTED_SHOTS_2PT,CONTESTED_SHOTS_3PT
0,201985_2014-15,12.4,26,2014-15,0.3,0.0,0.3,0.4,0.7,1.0,...,4.23,0.462,0.043,2.8,2.6,11.1,0.0,0.0,0.0,0.0
1,201166_2014-15,23.0,82,2014-15,0.7,0.2,0.6,0.9,0.693,1.0,...,7.63,0.428,0.003,6.4,5.7,18.7,0.0,0.0,0.0,0.0
2,203932_2014-15,17.0,47,2014-15,0.4,0.5,0.9,1.5,0.577,1.0,...,6.2,0.407,-0.041,4.2,4.0,15.0,0.0,0.0,0.0,0.0
3,203940_2014-15,23.1,32,2014-15,0.6,0.3,2.8,3.9,0.72,1.0,...,9.0,0.563,0.094,7.8,7.2,25.0,0.0,0.0,0.0,0.0
4,201143_2014-15,30.5,76,2014-15,0.9,1.3,2.4,4.3,0.553,1.0,...,11.68,0.434,-0.028,9.3,7.1,26.2,0.0,0.0,0.0,0.0


In [17]:
pid_name_map_df

Unnamed: 0,id,name
0,1628408,doziepj01
1,202683,kanteen01
2,1627098,delanma01
3,201948,dayeau01
4,1626170,grantje02
...,...,...
1157,1630166,avdijde01
1158,203937,anderky01
1159,1626174,woodch01
1160,203318,ricegl02


In [28]:
master_clutch_df["Season"].max()

'2019-20'

In [5]:
len(master_clutch_df)

2672

In [65]:
for df in master:
    print(df.columns)

Index(['pidSzn', 'MIN', 'GP', 'Season', 'PIE', 'POSS', 'USG_PCT'], dtype='object')
Index(['pidSzn', 'MIN', 'GP', 'Season', 'STL', 'BLK', 'DEF_RIM_FGM',
       'DEF_RIM_FGA', 'DEF_RIM_FG_PCT', 'FREQ', 'D_FGM', 'D_FGA', 'D_FG_PCT',
       'PCT_PLUSMINUS', 'OPP_PTS_2ND_CHANCE', 'OPP_PTS_FB', 'OPP_PTS_PAINT',
       'DEFLECTIONS', 'CHARGES_DRAWN', 'CONTESTED_SHOTS_2PT',
       'CONTESTED_SHOTS_3PT'],
      dtype='object')
Index(['pidSzn', 'OFF_LOOSE_BALLS_RECOVERED', 'DEF_LOOSE_BALLS_RECOVERED',
       'LOOSE_BALLS_RECOVERED', 'MIN', 'GP', 'Season', 'PFD', 'DRIVE_PF',
       'DRIVE_PF_PCT', 'DIST_MILES', 'DIST_MILES_OFF', 'DIST_MILES_DEF',
       'AVG_SPEED', 'AVG_SPEED_OFF', 'AVG_SPEED_DEF'],
      dtype='object')
Index(['pidSzn', 'SCREEN_ASSISTS', 'SCREEN_AST_PTS', 'PAINT_TOUCH_PASSES',
       'PAINT_TOUCH_PASSES_PCT', 'PAINT_TOUCH_AST', 'PAINT_TOUCH_AST_PCT',
       'PAINT_TOUCH_TOV', 'PAINT_TOUCH_TOV_PCT', 'PAINT_TOUCHES',
       'POST_TOUCH_PASSES', 'POST_TOUCH_PASSES_PCT', 'POST_TOUC

### Play by Play (pbp) Data
https://api.pbpstats.com/docs#/

- Metrics by Zone (Totals, Accuracy, Frequency)
- Fouls
- Normalized metrics (per 100Poss)
- Net Ratings
- Leverage Situation metrics

Thought: Should we include efficieny metrics in classifying player archetypes? Should archetypes be based on how good they are of a player?

In [6]:
import requests

In [7]:
players_response = requests.get("https://api.pbpstats.com/get-all-players-for-league/nba")
players = players_response.json()
players["players"]["201142"]

'Kevin Durant'

In [8]:
#Season Total Stats PBP API
url = "https://api.pbpstats.com/get-all-season-stats/nba"
params = {
    "EntityType": "Player",
    "EntityId": "201142"
}
response = requests.get(url, params=params)
response_json = response.json()
response_json["results"]["Regular Season"][0]

{'Season': '2007-08',
 'SecondsPlayed': 166081.0,
 'GamesPlayed': 80,
 'Minutes': 2768,
 'PlusMinus': -653,
 'OffPoss': 5586,
 'DefPoss': 5595,
 'PenaltyOffPoss': 1382,
 'PenaltyDefPoss': 1284,
 'SecondChanceOffPoss': 705,
 'TotalPoss': 11181,
 'AtRimFGM': 222,
 'AtRimFGA': 369,
 'SecondChanceAtRimFGM': 30,
 'SecondChanceAtRimFGA': 59,
 'PenaltyAtRimFGM': 58,
 'PenaltyAtRimFGA': 87,
 'ShortMidRangeFGM': 90,
 'ShortMidRangeFGA': 265,
 'LongMidRangeFGM': 216,
 'LongMidRangeFGA': 527,
 'Corner3FGM': 6,
 'Corner3FGA': 16,
 'PenaltyCorner3FGA': 1,
 'Arc3FGM': 53,
 'Arc3FGA': 189,
 'SecondChanceArc3FGM': 3,
 'SecondChanceArc3FGA': 22,
 'PenaltyArc3FGM': 20,
 'PenaltyArc3FGA': 51,
 'FG2M': 528,
 'FG2A': 1161,
 'FG3M': 59,
 'FG3A': 205,
 'FtPoints': 391,
 'Points': 1624,
 'OpponentPoints': 6276,
 'SecondChanceFG2M': 58,
 'SecondChanceFG2A': 124,
 'SecondChanceFG3M': 3,
 'SecondChanceFG3A': 22,
 'SecondChanceFtPoints': 28,
 'SecondChancePoints': 153,
 'PenaltyFG2M': 122,
 'PenaltyFG2A': 250,
 '

In [8]:
#Season Total Stats PBP API filtered by Leverage Situation
url = "https://api.pbpstats.com/get-totals/nba"
params = {
    "Season": "2021-22",
    "SeasonType":"Regular Season",
    "Type":"Player",
    "EntityId": "201142",
    "Leverage":"High, VeryHigh" #Low Leverage Removed

}
response = requests.get(url, params=params)
response_json = response.json()

In [10]:
len(response_json["multi_row_table_data"][0])

174

In [11]:
2672*174

464928