# NBA Player Data Analysis Project

## Project Outline

1. **Introduction**
   - Overview of the project
   - Goals and objectives

2. **Data Collection**
   - Import necessary libraries
   - Fetch player data from the NBA API

3. **Data Processing**
   - Load data into a Pandas DataFrame
   - Data cleaning and transformation

4. **Data Analysis**
   - Exploratory data analysis (EDA)
   - Visualizations

5. **Conclusion**
   - Summary of findings
   - Future work


## 0. Environment

### Python

This project is written in Python, which means that Python must be installed in your environment to run the project. The minimum supported version is 3.10.

#### Windows

You can use the Windows package manager `winget`, or the [installer](https://www.python.org/downloads/windows/) from the website.
```powershell
# you can change the version in the package name to your desired version
winget install Python.Python.3.12
```

#### MacOS
Python is already installed by default on recent versions of MacOS. If you have an older version that is not supported, you can use the [Homebrew](https://brew.sh/) package manager to install it, or the [installer](https://www.python.org/downloads/macos/) from the website.
```zsh
brew install python
```

#### Linux
Python is already installed by default on most distributions of Linux. If it isn't, you can use your distribution's package manager to install Python.

### Virtual Environment

It's generally recommended that you use a virtual environment (or venv) for this project. That way, all dependencies can be installed for the project without affecting the rest of your system. You can create a venv with Python:

```bash
python -m venv .venv
```

To activate the virtual environment in your shell, you can use the following commands.

On Windows:

```powershell
.venv\Scripts\activate
```

On other operating systems:

```bash
.venv/bin/activate
```

### Dependencies

This project uses [Poetry](https://python-poetry.org/) to manage its dependencies. You can install the dependencies with the `poetry` command:

`poetry install`

If you don't want to use Poetry, a `requirements.txt` is also provided. You can install this using `pip`:

`pip install -r requirements.txt`

### Imports

In [None]:
import requests
import json
import time
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
from typing import Dict

### Environment Variables

We will load all our environment variables from a `.env` file, if one is provided.

This project uses the [BALLDONTLIE](https://app.balldontlie.io/) API, which has a free key available [if you sign up for an account](https://app.balldontlie.io/signup).

If database information is provided, all dataframes used for analysis are uploaded to it. We use [Postgres](https://www.postgresql.org/about/) with the [Psycopg](https://www.psycopg.org/psycopg3/) driver by default but any kind of database is supported.

In [None]:
from dotenv import load_dotenv
load_dotenv()
BALLDONTLIE_API_KEY = os.getenv("BALLDONTLIE_API_KEY")
DB_TYPE = os.getenv("DB_TYPE", "postgresql+psycopg")
DB_USER = os.getenv("DB_USER", "postgres")
DB_PASSWORD = os.getenv("DB_PASSWORD")
DB_HOST = os.getenv("DB_HOST", "localhost")
DB_PORT = os.getenv("DB_PORT", "5432")
DB_NAME = os.getenv("DB_NAME", "postgres")

### Presentation

By default, Pandas dataframes are truncated when they are printed. We want to be able to view all of the data at once, so we embed the dataframe in a scrollable element.

In [None]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

def custom_scrollable_display(df: pd.DataFrame, max_height=400):
    """
    Custom display function to render DataFrames as scrollable elements.
    
    Parameters:
    - df: The DataFrame to display.
    - max_height: The maximum height of the scrollable area in pixels.
    """
    style = f"""
    <style>
    .scrollable-dataframe {{
        display: inline-block;
        white-space: nowrap;
        overflow-x: scroll;
        max-height: {max_height}px;
        overflow-y: scroll;
    }}
    </style>
    """
    display(HTML(style + f'<div class="scrollable-dataframe">{df.to_html()}</div>'))

def custom_display_hook(df):
    custom_scrollable_display(df)
    return ""

# hook up the custom display function to the automatic printer
InteractiveShell.instance().display_formatter.formatters['text/html'].for_type(pd.DataFrame, custom_display_hook);


### Pre-Commit Hooks (Developer Only)

This notebook uses `nbstripout` to strip notebook output from Git commits. If you are committing code, please run the following command to set up the Git filter.

Poetry is required for the pre-commit hooks, so make sure it is installed before you commit code. You will also need to add the plugin `poetry-plugin-export` in order to run the export hook.
```bash
poetry self add poetry-plugin-export
```

In [None]:
!nbstripout --install
!pre-commit install

## 1. Data Collection

### Fetch Player Data from NBA API

`nba_api` provides static player and team information, which we will download here so that we can reuse it without requesting the API unnecessarily.

In [None]:
from nba_api.stats.static import players, teams
PLAYERS_LIST_FILE = "../data/players_list.json"
TEAMS_LIST_FILE = "../data/teams_list.json"

if not os.path.exists(PLAYERS_LIST_FILE):
    players_list = players.get_players()
    with open(PLAYERS_LIST_FILE, "w") as f:
        json.dump(players_list, f)
else:
    with open(PLAYERS_LIST_FILE, "r") as f:
        players_list = json.load(f)
if not os.path.exists(TEAMS_LIST_FILE):
    teams_list = teams.get_teams()
    teams.get_wnba_teams
    with open(TEAMS_LIST_FILE, "w") as f:
        json.dump(teams_list, f)
else:
    with open(TEAMS_LIST_FILE, "r") as f:
        teams_list = json.load(f)

### Fetch Game Data



We're only interested in games that are either in the regular season or in the playoffs. We'll add an enum to distinguish the type of game and use it to differentiate them.

In [None]:
from enum import Enum
class SeasonType(Enum):
    PRESEASON = 1
    REGULAR_SEASON = 2
    ALL_STAR = 3
    PLAYOFFS = 4
    PLAY_IN = 5
    NBA_CUP = 6
class Season():
    def __init__(self, season_id: int) -> None:
        season_id_str = str(season_id)
        self.season_type = SeasonType(int(season_id_str[0]))
        self.season_year = int(season_id_str[1:])

In [None]:
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.library.parameters import LeagueIDNullable, LeagueID
START_SEASON = 2013
END_SEASON = 2023
GAMES_LIST_FILE = "../data/games_list.csv"
if os.path.exists(GAMES_LIST_FILE):
    games_list: pd.DataFrame = pd.read_csv(GAMES_LIST_FILE)
else:
    games_list = pd.DataFrame()
    for season in range(START_SEASON, END_SEASON + 1):
        # put season into the correct form e.g. 2023 -> 2023-24
        season_str = f"{season}-{str(season + 1)[2:]}"
        print(f"Fetching games for season: {season_str}", end="\r")
        gamefinder = leaguegamefinder.LeagueGameFinder(season_nullable=season_str, league_id_nullable=LeagueIDNullable.nba)
        games = gamefinder.get_data_frames()[0]
        games_list = pd.concat([games_list, games], ignore_index=True)
        time.sleep(0.6)
    games_list.to_csv(GAMES_LIST_FILE, index=False)
# games_list["SEASON_ID"].unique()
games_list.head()

### Fetch Play by Plays

In [None]:
from nba_api.stats.endpoints import playbyplayv3
from requests.exceptions import ReadTimeout
# 65502 games
PBP_LIST_FILE = "../data/pbp_list.csv"
if os.path.exists(PBP_LIST_FILE):
    pbp_list = pd.read_csv(PBP_LIST_FILE)
else:
    pbp_list = pd.DataFrame()
    for index, row in games_list.iterrows():
        err = False
        game_id = row["GAME_ID"]
        game_date = row["GAME_DATE"]
        season_id = row["SEASON_ID"]
        season = Season(season_id)
        if season.season_type != SeasonType.REGULAR_SEASON and season.season_type != SeasonType.PLAYOFFS:
            continue
        print(f"Fetching play by play for game on {game_date}", end="\r")
        while True:
            try:
                pbpfinder = playbyplayv3.PlayByPlayV3(f"{game_id:010}")
                break
            except ReadTimeout as e:
                print(f"{e}! Try again")
            except IndexError:
                print(f"{game_id} does not have a play by play")
                err = True
                break
        if err:
            continue
        pbp = pbpfinder.get_data_frames()[0]
        pbp_list = pd.concat([pbp_list, pbp], ignore_index=True)
        time.sleep(0.6)
    pbp_list.to_csv(PBP_LIST_FILE, index=False)
pbp_list.head()


- 00: 30000
- 07: 10
- 08: 36
- 09: 119
- 10: 13098
- 12: 3094
- 13: 85
- 14: 320
- 15: 2170
- 16: 109
- 17: 91
- 18: 170
- 19: 174
- 20: 20142
- 22: 6053
- 25: 2

In [None]:
### ALTERNATIVE, USING NBA_API
from nba_api.stats.endpoints import playbyplayv3
from nba_api.stats.library.parameters import SeasonType
from nba_api.stats.endpoints import leaguegamefinder

# Define the season (e.g., '2023-24')
season = '2023-24'
# season_type = SeasonType.REGULAR  # or SeasonType.PLAYOFFS

# Get game IDs
gamefinder = leaguegamefinder.LeagueGameFinder(season_nullable=season)
games: pd.DataFrame = gamefinder.get_data_frames()[0]
game_ids = games['GAME_ID'].tolist()

def fetch_play_by_play(game_id: str) -> pd.DataFrame:
    pbp = playbyplayv3.PlayByPlayV3(game_id)
    return pbp.get_data_frames()[0]
play_by_play_data = fetch_play_by_play(game_ids[0])
# play_by_play_data[play_by_play_data["EVENTMSGACTIONTYPE"] == 79]
game_ids[:10]

Before we analyze the statistics for each players, we need to get a list of all players that had minutes in the 2023 season.

In [None]:
BASE_URL = "https://api.balldontlie.io/v1/"
PLAYERS_ENDPOINT = "players"
PAGE_100 = "per_page=100"
HEADERS = {
    "Authorization": f"{BALLDONTLIE_API_KEY}"
}
ALL_PLAYERS_FILE = "../data/all_players.json"

if os.path.exists(ALL_PLAYERS_FILE):
    print(f"{ALL_PLAYERS_FILE} already exists.")
    with open(ALL_PLAYERS_FILE, "r") as f:
        all_players = json.load(f)
else:
    all_players = []
    next_cursor = None

    while True:
        if next_cursor:
            url = f"{BASE_URL}{PLAYERS_ENDPOINT}?{PAGE_100}&cursor={next_cursor}"
        else:
            url = f"{BASE_URL}{PLAYERS_ENDPOINT}?{PAGE_100}"
        
        response = requests.get(url, headers=HEADERS)
        
        if response.status_code == 200:
            data = response.json()
            all_players.extend(data["data"])
            next_cursor = data["meta"].get("next_cursor")
            
            if not next_cursor:
                break
            
            time.sleep(2)
        else:
            print(f"Request failed with status code {response.status_code}")
            break

    # Save all players data to a JSON file
    with open(ALL_PLAYERS_FILE, 'w') as f:
        json.dump(all_players, f, indent=4)

    print(f"All player data has been saved to {ALL_PLAYERS_FILE}")

print(all_players[0])
len(all_players)


This JSON file contains a list of all players in the NBA. We're only concerned with the players from the current season, so we need to eliminate all the players who aren't. One quick heuristic we can use is draft year. Obviously, a player who was drafted in 1986 will not be playing now. We want to pick a cutoff year as close as possible to the currrent year to eliminate as many players as we can. The easiest way to do this is check the oldest players still playing in the NBA, mark them as exceptions, and use their draft year as a starting point.
A list of the oldest players still in the NBA can be found at [this Wikipedia page](https://en.wikipedia.org/wiki/List_of_oldest_and_youngest_NBA_players#Active). We chose 2010 as the cutoff.

In [None]:
OLD_PLAYERS = ["LeBron James", "Chris Paul", "Kyle Lowry", "PJ Tucker", "Kevin Durant", "Al Horford", "Mike Conley", "Jeff Green", "Derrick Rose", "Russell Westbrook", "Kevin Love", "Eric Gordon", "Brook Lopez", "Nicolas Batum", "DeAndre Jordan", "James Harden", "Stephen Curry", "DeMar DeRozan", "Jrue Holiday", "Taj Gibson", "Paul George"]
all_players_after_2010 = [player for player in all_players if player["draft_year"] == None or player["draft_year"] > 2010 or f"{player['first_name']} {player['last_name']}" in OLD_PLAYERS]
len(all_players_after_2010)

You'll notice that we included players that have `null` for their draft year. That's because those players are undrafted. There are some undrafted players currently in the NBA, so we can't exclude them purely based on that fact. The website [2KRatings](https://www.2kratings.com/) maintains [a list](https://www.2kratings.com/lists/undrafted-nba-players) of all active undrafted players. This includes players in the G League, however. We decided to take the top 35 players as, after that point, the players play so few minutes that their stats will have a negligible impact on analysis.

In [None]:
UNDRAFTED_PLAYERS = ["Fred VanVleet", "Austin Reaves", "Naz Reid", "T.J. McConnell", "Luguentz Dort", "Alex Caruso", "Derrick Jones Jr.", "Duncan Robinson", "Simone Fontecchio", "Gary Payton II", "Max Strus", "Luke Kornet", "Jock Landale", "Christian Wood", "Caleb Martin", "Chris Boucher", "Dorian Finney-Smith", "Robert Covington", "Jose Alvarado", "Javonte Green", "Sam Hauser", "Keon Ellis", "Duop Reath", "Royce O'Neale", "Naji Marshall", "Scotty Pippen Jr.", "Haywood Highsmith", "Drew Eubanks", "Gabe Vincent", "Daniel Theis", "Maxi Kleber", "Jordan McLaughlin", "Jordan Goodwin", "Damion Lee", "Lamar Stevens"]
all_players_after_2010_without_undrafted = [player for player in all_players_after_2010 if player["draft_year"] != None or f"{player['first_name']} {player['last_name']}" in UNDRAFTED_PLAYERS]
len(all_players_after_2010_without_undrafted)

### Get Stats for each Player

Now that we've gotten our list of players, we can get the stats for each of them.

In [None]:
STATS_ENDPOINT = "stats"
# we only need stats from the latest season
CURRENT_SEASON = "seasons[]=2023"
ALL_PLAYERS_STATS_FILE = "../data/all_players_stats.json"

if os.path.exists(ALL_PLAYERS_STATS_FILE):
    print(f"{ALL_PLAYERS_STATS_FILE} already exists.")
    with open(ALL_PLAYERS_STATS_FILE, "r") as f:
        all_players_stats = json.load(f)
else:
    all_players_stats = {}
    
    for index, player in enumerate(all_players_after_2010_without_undrafted):
        # we only need to query this once because there are fewer than 100 games in a season
        url = f"{BASE_URL}{STATS_ENDPOINT}?{PAGE_100}&{CURRENT_SEASON}&player_ids[]={player['id']}"
        response = requests.get(url, headers=HEADERS)

        if response.status_code == 200:
            data = response.json()
            all_players_stats.update({player["id"]: data["data"]})
            print(f"Downloaded {index} of {len(all_players_after_2010_without_undrafted)}", end="\r")
            time.sleep(2)
        else:
            print(f"Request failed with status code {response.status_code}")
            break
    
    with open(ALL_PLAYERS_STATS_FILE, "w") as f:
        json.dump(all_players_stats, f)
    
    print(f"All player data has been saved to {ALL_PLAYERS_STATS_FILE}")

print(all_players_stats["15"])

Although we've eliminated most of the players who aren't playing in the current season, some false positives remain. We can take care of the ones that didn't play at all by checking to see if their stats are an empty list.

In [None]:
all_players_stats_current_season = {player_id: stats for player_id, stats in all_players_stats.items() if stats}
len(all_players_stats_current_season)

We're also only really concerned with players that have significant play time. We generally define this as having an average of at least 16 minutes per game.

In [None]:
all_players_stats_current_season_with_16_mins = {player_id: stats for player_id, stats in all_players_stats_current_season.items() if sum(int(stat["min"]) for stat in stats) / len(stats) >= 16}
len(all_players_stats_current_season_with_16_mins)

Finally, we have to exclude games where the players didn't play as those would skew the results, i.e. games where they played 0 minutes.

In [None]:
all_players_stats_current_season_with_16_mins_and_play_time = {player_id: [stat for stat in stats if int(stat["min"]) > 0] for player_id, stats in all_players_stats_current_season_with_16_mins.items()}

## 2. Data Processing

We used a Pandas dataframe to store these stats for each player. In addition to the direct statistics recorded like points and rebounds, we added an additional `is_home` indicator which is true when the game was played at home.

In [None]:
df = pd.DataFrame()
for player_id, stats in all_players_stats_current_season_with_16_mins_and_play_time.items():
    desired_stats = ["fgm", "fga", "fg_pct", "fg3m", "fg3a", "fg3_pct", "ftm", "fta", "ft_pct", "oreb", "dreb", "reb", "ast", "stl", "blk", "turnover", "pf", "pts"]
    data = {
        "player_id": int(player_id),
        "player_name": f"{stats[0]['player']['first_name']} {stats[0]['player']['last_name']}",
    }
    data["min"] = [int(stat["min"]) for stat in stats]
    for desired_stat in desired_stats:
        data[desired_stat] = [stat[desired_stat] for stat in stats]
    data["is_home"] = [stat["game"]["home_team_id"] == stat["team"]["id"] for stat in stats]
    df_player = pd.DataFrame(data=data)
    df = pd.concat([df, df_player], ignore_index=True)
# df.to_excel("NBA_2023_Stats.xlsx")
df

As you can see in this data, the percentage for field goals, free throws, and 3 point shots is 0 when both attempts and makes are 0. This doesn't really make sense and can skew our results. In order to fix this, we will replace all instances of 0 in this case with `NaN`, so that aggregate analysis will ignore those games.

In [None]:
df["fg_pct"] = df.apply(lambda row: row["fgm"] / row["fga"] if row["fga"] > 0 else np.nan, axis=1)
df["fg3_pct"] = df.apply(lambda row: row["fg3m"] / row["fg3a"] if row["fg3a"] > 0 else np.nan, axis=1)
df["ft_pct"] = df.apply(lambda row: row["ftm"] / row["fta"] if row["fta"] > 0 else np.nan, axis=1)
df

## 3. Data Analysis

In order to analyze this data, it will be useful for us to aggregate this data by player when comparing players.

In [None]:
desired_stats = ["min", "fgm", "fga", "fg_pct", "fg3m", "fg3a", "fg3_pct", "ftm", "fta", "ft_pct", "oreb", "dreb", "reb", "ast", "stl", "blk", "turnover", "pf", "pts"]

df_grouped = df.groupby("player_id").agg(
    player_name=("player_name", "first"),
    # min=("min", "mean"),
    # fgm=("fgm", "mean"),
    # fga=("fga", "mean"),
    # fg_pct=("fg_pct", "mean"),
    fg3m=("fg3m", "mean"),
    fg3a=("fg3a", "mean"),
    fg3_pct=("fg3_pct", "mean"),
    # ftm=("ftm", "mean"),
    # fta=("fta", "mean"),
    # ft_pct=("ft_pct", "mean"),
    # oreb=("oreb", "mean"),
    # dreb=("dreb", "mean"),
    # reb=("reb", "std"),
    # ast=("ast", "mean"),
    # stl=("stl", "std"),
    # blk=("blk", "mean"),
    # turnover=("turnover", "mean"),
    # pf=("pf", "mean"),
    # pts=("pts", "var"),
    # is_home=("is_home", "mean")
)
# print(df.groupby("player_id")["pts"].std().reset_index().sort_values("pts"))
df_grouped_over_35 = df_grouped[df_grouped["fg3_pct"] > 0.35]
df_grouped_over_35_over_3 = df_grouped_over_35[df_grouped_over_35["fg3a"] > 3]
df_grouped_over_35_over_3.sort_values("fg3_pct", ascending=False)
# df_grouped_over_35_over_3.to_excel("3 Point Stats for over 0.35 percentage and 3 makes average.xlsx")

In [None]:
# distribution of shai's scores
import scipy.stats

stat = "pts"
df_jaylen = df.loc[df.loc[:, "player_id"] == 70, :]

plt.figure(figsize=(10, 6))
sns.histplot(df_jaylen.loc[:, stat], bins=20, kde=True, stat="density", color="skyblue", label="Steals Histogram")


mu, std = scipy.stats.norm.fit(df_jaylen.loc[:, stat])

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = scipy.stats.norm.pdf(x, mu, std)
plt.plot(x, p, "k", linewidth=2, label="Normal Distribution Fit")

# plt.xlabel("Number of Steals")
# plt.ylabel("Density")
# plt.title("Histogram of Steals with Normal Distribution Fit")

print(scipy.stats.jarque_bera(df_jaylen.loc[:, stat]))

plt.show()

## 4. Data Reporting

To create our final report from our analysis, we will be using PowerBI. We have a Postgres database that our PowerBI report will import the tables from. If no database is available, the dataframes will instead export as an Excel spreadsheet, which can be manually uploaded to PowerBI.

In [None]:
from sqlalchemy import create_engine
from sqlalchemy.engine import reflection
DATABASE_URL = f"{DB_TYPE}://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(DATABASE_URL)

with engine.connect() as connection:
    display("Connection to the database was successful!")

def upload_dataframes(dfs: Dict[str, pd.DataFrame]) -> None:
    inspector = reflection.Inspector.from_engine(engine)
    existing_tables = inspector.get_table_names()

    for table_name, df in dfs.items():
        # check if table already exists
        if table_name in existing_tables:
            # if there are no changes to the table, do not write to it
            existing_df = pd.read_sql_table(table_name, engine)
            if df.shape == existing_df.shape and df.equals(existing_df):
                print(f"No changes detected for table {table_name}. Skipping upload.")
                continue
        else:
            print(f"Table {table_name} does not exist. Creating a new one.")
        df.to_sql(table_name, engine, if_exists="replace", index=False)
        print(f"Uploaded DataFrame to table {table_name}.")