# Does the "Hot-Hand" in the NBA Truly Exist?

In this analysis, I explore whether NBA players exhibit the "hot hand" phenomenon—an idea suggesting that a player's chance of making a shot increases after making previous ones. This belief is deeply ingrained in basketball culture, where players are often said to be “on fire” or “in a slump” depending on recent performance.

To test this hypothesis, I use shot-by-shot play-by-play data from the 2024–2025 NBA regular season, pulled directly from the NBA API. Specifically, I examine whether the outcome of a previous field goal attempt (made or missed) has a statistically significant effect on the likelihood of making the next shot.

By applying statistical methods such as chi-square tests and logistic regression, this study aims to determine whether there is meaningful evidence supporting the existence of streak shooting—or if the “hot hand” is merely a cognitive illusion.

## Loading Dependencies

In [13]:
import time
import pandas as pd
from nba_api.stats.endpoints import LeagueGameLog, PlayByPlayV2
from requests.exceptions import ReadTimeout, ConnectionError
from scipy.stats import chi2_contingency
import statsmodels.api as sm
from tqdm import tqdm

## Extracting the data from NBA api

In [16]:
# Define headers to simulate a browser
headers = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "https://www.nba.com",
    "Origin": "https://www.nba.com"
}

# Fetch all regular season game IDs for 2023–24
def get_game_ids():
    games = LeagueGameLog(season='2023-24', season_type_all_star='Regular Season', headers=headers, timeout=60)
    game_log_df = games.get_data_frames()[0]
    return game_log_df['GAME_ID'].unique()

# Fetch play-by-play for a single game with retries and error handling
def fetch_pbp(game_id, retries=3, sleep_sec=3):
    for attempt in range(retries):
        try:
            pbp = PlayByPlayV2(game_id=game_id, headers=headers, timeout=60)
            df = pbp.get_data_frames()[0]
            df['GAME_ID'] = game_id
            return df
        except (ReadTimeout, ConnectionError) as e:
            print(f"⚠️ Network error fetching Game ID {game_id} (attempt {attempt+1}): {e}")
            time.sleep(sleep_sec * (attempt + 1))
        except json.JSONDecodeError:
            print(f"⚠️ JSON error for Game ID {game_id} (attempt {attempt+1}) — likely a bad or blocked response.")
            time.sleep(sleep_sec * (attempt + 1))
        except Exception as e:
            print(f"⚠️ Unexpected error for Game ID {game_id} (attempt {attempt+1}): {e}")
            time.sleep(sleep_sec * (attempt + 1))
    print(f"❌ Skipped Game ID {game_id} after {retries} attempts")
    return None

# Main loop to fetch all data and write to a single CSV file
def scrape_season_play_by_play_single_file(output_filename="pbp_2023_24_season.csv"):
    game_ids = get_game_ids()
    all_dfs = []

    for gid in tqdm(game_ids, desc="Scraping games", unit="game"):
        pbp_df = fetch_pbp(gid)
        if pbp_df is not None:
            all_dfs.append(pbp_df)
        time.sleep(0.3)  # Sleep 0.3 seconds between games to avoid throttling

    if all_dfs:
        combined_df = pd.concat(all_dfs, ignore_index=True)
        combined_df.to_csv(output_filename, index=False)
        print(f"\n💾 Saved all {len(combined_df)} rows to {output_filename}")
    else:
        print("❌ No data was fetched.")

# Run the script
scrape_season_play_by_play_single_file()

Scraping games: 100%|██████████| 1230/1230 [07:45<00:00,  2.64game/s]



💾 Saved all 567672 rows to pbp_2023_24_season.csv


In [17]:
## Import CSV into a dataframe

# CSV file path
csv_file_path = "pbp_2023_24_season.csv"

# Import CSV to DataFrame
pbp_df = pd.read_csv(csv_file_path)

# Set pandas option to display all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 60)
# Display the first few rows

pbp_df.head(5)

Unnamed: 0,GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,SCORE,SCOREMARGIN,PERSON1TYPE,PLAYER1_ID,PLAYER1_NAME,PLAYER1_TEAM_ID,PLAYER1_TEAM_CITY,PLAYER1_TEAM_NICKNAME,PLAYER1_TEAM_ABBREVIATION,PERSON2TYPE,PLAYER2_ID,PLAYER2_NAME,PLAYER2_TEAM_ID,PLAYER2_TEAM_CITY,PLAYER2_TEAM_NICKNAME,PLAYER2_TEAM_ABBREVIATION,PERSON3TYPE,PLAYER3_ID,PLAYER3_NAME,PLAYER3_TEAM_ID,PLAYER3_TEAM_CITY,PLAYER3_TEAM_NICKNAME,PLAYER3_TEAM_ABBREVIATION,VIDEO_AVAILABLE_FLAG
0,22300061,2,12,0,1,7:36 PM,12:00,,Start of 1st Period (7:36 PM EST),,,,0,0,,,,,,0,0,,,,,,0,0,,,,,,0
1,22300061,4,10,0,1,7:36 PM,12:00,Jump Ball Jokic vs. Davis: Tip to James,,,,,4,203999,Nikola Jokić,1610613000.0,Denver,Nuggets,DEN,5,203076,Anthony Davis,1610613000.0,Los Angeles,Lakers,LAL,5,2544,LeBron James,1610613000.0,Los Angeles,Lakers,LAL,1
2,22300061,7,1,7,1,7:37 PM,11:42,,,Davis 1' Dunk (2 PTS) (Russell 1 AST),2 - 0,-2,5,203076,Anthony Davis,1610613000.0,Los Angeles,Lakers,LAL,5,1626156,D'Angelo Russell,1610613000.0,Los Angeles,Lakers,LAL,0,0,,,,,,1
3,22300061,10,1,101,1,7:37 PM,11:15,Jokic 7' Driving Floating Jump Shot (2 PTS) (M...,,,2 - 2,TIE,4,203999,Nikola Jokić,1610613000.0,Denver,Nuggets,DEN,4,1627750,Jamal Murray,1610613000.0,Denver,Nuggets,DEN,0,0,,,,,,1
4,22300061,11,1,1,1,7:37 PM,10:57,,,Prince 24' 3PT Jump Shot (3 PTS) (James 1 AST),5 - 2,-3,5,1627752,Taurean Prince,1610613000.0,Los Angeles,Lakers,LAL,5,2544,LeBron James,1610613000.0,Los Angeles,Lakers,LAL,0,0,,,,,,1


## Data Description

### Key Columns in the Play-by-Play Dataset

There are 34 total columns available in the dataset. Below are some of the **most relevant columns** for this project, along with short descriptions:

#### 📝 Basic Game and Event Info
- **`GAME_ID`**: Unique identifier for each NBA game.
- **`EVENTNUM`**: Sequential number of the event within the game.
- **`EVENTMSGTYPE`**: High-level category of the event (e.g., field goal, rebound, foul).
- **`EVENTMSGACTIONTYPE`**: More detailed classification of the action (e.g., pull-up jumper, driving dunk).

#### 🕒 Timing Information
- **`PERIOD`**: The quarter or overtime period (1–4 for regular periods; 5+ for OT).
- **`WCTIMESTRING`**: Wall clock time when the event occurred.
- **`PCTIMESTRING`**: Time remaining in the current period.

#### 🏀 Event Descriptions
- **`HOMEDESCRIPTION`**: Text description of the event from the home team's perspective.
- **`VISITORDESCRIPTION`**: Text description of the event from the visiting team's perspective.

#### 📊 Score Information
- **`SCORE`**: The score at the time of the event (format: `home - visitor`).
- **`SCOREMARGIN`**: Point margin for the home team (positive = home team leading).

#### 👤 Player & Team Info
- **`PLAYER1_NAME`**: Name of the main player involved in the event.
- **`PLAYER1_TEAM_ID`** / **`PLAYER1_TEAM_CITY`**: ID and city of that player's team.
- **`PLAYER2_NAME`**, **`PLAYER2_TEAM_ID`**, **`PLAYER2_TEAM_CITY`**: Info for a second involved player, if applicable.

---

### `EVENTMSGTYPE` Values

To interpret the type of each event, here's a lookup table (based on the [NBA API docs](https://github.com/swar/nba_api/blob/master/docs/examples/PlayByPlay.ipynb)):

| `EVENTMSGTYPE` | Description        |
|----------------|--------------------|
| 1              | Field Goal Made     |
| 2              | Field Goal Missed   |
| 3              | Free Throw Attempt  |
| 4              | Rebound             |
| 5              | Turnover            |
| 6              | Foul                |
| 7              | Violation           |
| 8              | Substitution        |
| 9              | Timeout             |
| 10             | Jump Ball           |
| 11             | Ejection            |
| 12             | Period Begin        |
| 13             | Period End          |
| 18             | Instant Replay      |

> ℹ️ You'll notice the full list ranges from 1 to 18, though not all values may appear in a single game.

---

### Understanding `EVENTMSGACTIONTYPE` Values

Below is a categorized breakdown of NBA shot types and their corresponding `EVENTMSGACTIONTYPE` codes. These help describe the specific kind of shot or move attempted.

---

#### 🏀 3-Point Shots

| Code | Description                                 |
|------|---------------------------------------------|
| 1    | 3PT Jump Shot                               |
| 2    | 3PT Running Jump Shot                       |
| 3    | Hook Shot                                   |
| 47   | 3PT Turnaround Jump Shot                    |
| 63   | 3PT Fadeaway Jumper                         |
| 66   | 3PT Jump Bank Shot                          |
| 78   | 3PT Floating Jump Shot                      |
| 79   | 3PT Pullup Jump Shot                        |
| 80   | 3PT Step Back Jump Shot                     |
| 83   | 3PT Fadeaway Bank Shot                      |
| 85   | 3PT Turnaround Bank Shot                    |
| 86   | 3PT Turnaround Fadeaway / Shot              |
| 101  | 3PT Driving Floating Jump Shot              |
| 102  | 3PT Driving Floating Bank Jump Shot         |
| 103  | 3PT Running Pull                            |
| 104  | 3PT Step Back Bank Jump Shot                |
| 105  | 3PT Turnaround Fadeaway Bank Jump Shot      |

---

#### 💥 Dunks

| Code | Description                     |
|------|---------------------------------|
| 7    | Dunk                            |
| 9    | Driving Dunk                    |
| 50   | Running Dunk                    |
| 51   | Reverse Dunk                    |
| 52   | Alley-Oop Dunk                  |
| 87   | Putback Dunk                    |
| 106  | Running Alley-Oop Dunk Shot     |
| 109  | Driving Reverse Dunk Shot       |
| 110  | Running Reverse Dunk Shot       |

---

#### 🖐️ Layups & Finger Rolls

| Code | Description                         |
|------|-------------------------------------|
| 5    | Layup                               |
| 6    | Driving Layup                       |
| 41   | Running Layup                       |
| 43   | Alley-Oop Layup                     |
| 44   | Reverse Layup                       |
| 71   | Finger Roll Layup                   |
| 72   | Putback Layup                       |
| 73   | Driving Reverse Layup               |
| 74   | Running Reverse Layup               |
| 75   | Driving Finger Roll Layup           |
| 76   | Running Finger Roll Layup           |
| 98   | Cutting Layup Shot                  |
| 99   | Cutting Finger Roll Layup Shot      |
| 100  | Running Alley-Oop Layup Shot        |
| 97   | Tip Layup Shot                      |

---

#### ⛹️‍♂️ Jumpers

| Code | Description                               |
|------|-------------------------------------------|
| 1    | Jump Shot                                 |
| 2    | Running Jump Shot                         |
| 47   | Turnaround Jump Shot                      |
| 63   | Fadeaway Jumper                           |
| 66   | Jump Bank Shot                            |
| 78   | Floating Jump Shot                        |
| 79   | Pullup Jump Shot                          |
| 80   | Step Back Jump Shot                       |
| 83   | Fadeaway Bank Shot                        |
| 85   | Turnaround Bank Shot                      |
| 86   | Turnaround Fadeaway                       |
| 104  | Step Back Bank Jump Shot                  |
| 105  | Turnaround Fadeaway Bank Jump Shot        |

---

#### 🌀 Hook Shots

| Code | Description                     |
|------|---------------------------------|
| 3    | Hook Shot                       |
| 57   | Driving Hook Shot               |
| 58   | Turnaround Hook Shot            |
| 67   | Hook Bank Shot                  |
| 93   | Driving Bank Hook Shot          |
| 96   | Turnaround Bank Hook Shot       |

---

#### 🏃‍♂️ Movement-Based Shots

| Code | Description                             |
|------|-----------------------------------------|
| 98   | Cutting Layup Shot                      |
| 99   | Cutting Finger Roll Layup Shot          |
| 103  | Running Pull                            |
| 106  | Running Alley-Oop Dunk Shot             |
| 100  | Running Alley-Oop Layup Shot            |

---

> 📌 Note: Many action types can be both 2-point or 3-point, depending on court location. For example, `Jump Shot (1)` might be a 2-pointer or 3-pointer based on distance.
> These will help to filter out certain shots that we do not want for our analysis

## Cleaning

### 🧹 Data Cleaning: Preparing the Dataset for Analysis

Before conducting any statistical modeling or hypothesis testing, it's essential to refine the dataset to ensure we are analyzing only the most relevant and comparable shot attempts. `EVENTMSGTYPE` & `EVENTMSGACTIONTYPE` will be critical columns for this cleaning process. Below is a summary of the cleaning steps we will take:

---

### 🎯 Key Cleaning Objectives

1. **Reduce the number of columns** to retain only those that are necessary for our analysis (e.g., player, shot type, time, result).
2. **Remove shot attempts that were blocked** — since these are often heavily defended and not representative of a typical jump shot.
3. **Exclude fouled shot attempts**, which may result in free throws or affect the outcome in non-random ways.
4. **Filter for only Field Goal attempts**:
   - Keep rows where `EVENTMSGTYPE` is either `1` (Field Goal Made) or `2` (Field Goal Missed).
   - Remove all other play types such as free throws, rebounds, and fouls.


---

### 🔍 What Counts as a "Shot" in This Study?

For the purposes of exploring patterns in shooting performance—especially relationships between previous and current shot outcomes—we focus exclusively on **jump shots**.

   We will **exclude non-jump** shot attempts like:
   - Layups  
   - Dunks  
   - Hook shots  
   - Tip-ins  

These shots differ significantly in terms of:
- **Shot mechanics**
- **Defensive pressure**
- **Distance from the basket**

As such, these are considered *outliers* relative to standard jump shots and are removed to improve the consistency and interpretability of the analysis.

---

By applying these cleaning steps, we aim to construct a dataset that reflects consistent and comparable jump shot attempts, giving us a more accurate basis for testing hot hand effects and other shooting dynamics.


### Remove Uneccesary Columns

In [18]:
columns_to_keep = [
    'GAME_ID',
    'EVENTNUM',
    'EVENTMSGTYPE',
    'EVENTMSGACTIONTYPE',
    'PERIOD',
    'SCORE',
    'SCOREMARGIN',
    'PLAYER1_NAME',   # Player who took the shot
    'PLAYER1_TEAM_ABBREVIATION',
    'HOMEDESCRIPTION', 
    'VISITORDESCRIPTION',
    'PLAYER2_NAME'    # Player2 might refer to the player involved in the event, like a block
]

# Filter the DataFrame to keep only necessary columns
pbp_df = pbp_df[columns_to_keep]

### Filter out Blocked Shot Attempts

In [19]:
# Filter out rows where either 'HOMEDESCRIPTION' or 'VISITORDESCRIPTION' contains the word 'BLOCK'
pbp_df = pbp_df[~pbp_df['HOMEDESCRIPTION'].str.contains('BLOCK', na=False) & 
                                         ~pbp_df['VISITORDESCRIPTION'].str.contains('BLOCK', na=False)]

### Filter out Fouled Shot Attempts

In [20]:
# Filter out rows where either 'HOMEDESCRIPTION' or 'VISITORDESCRIPTION' contains the word 'FOUL'
pbp_df = pbp_df[~pbp_df['HOMEDESCRIPTION'].str.contains('FOUL', na=False) & 
                ~pbp_df['VISITORDESCRIPTION'].str.contains('FOUL', na=False)]

### Filter for only Field Goal Attempts

In [21]:
pbp_df = pbp_df[pbp_df['EVENTMSGTYPE'].isin([1, 2])]

### Filter for only Jumpshots

In [22]:
# Define Which Shot types will be used for our problem
JUMP_SHOT_ACTION_TYPES = {
    1,    # JUMP_SHOT
    2,    # RUNNING_JUMP_SHOT
    63,   # FADEAWAY_JUMPER
    66,   # JUMP_BANK_SHOT
    78,   # FLOATING_JUMP_SHOT
    79,   # PULLUP_JUMP_SHOT
    80,   # STEP_BACK_JUMP_SHOT
    101,  # DRIVING_FLOATING_JUMP_SHOT
    102,  # DRIVING_FLOATING_BANK_JUMP_SHOT
    103,  # RUNNING_PULL
    104,  # STEP_BACK_BANK_JUMP_SHOT
    105,  # TURNAROUND_FADEAWAY_BANK_JUMP_SHOT
    47,   # TURNAROUND_JUMP_SHOT
    86,   # TURNAROUND_FADEAWAY
    85,   # TURNAROUND_BANK_SHOT
    83   # FADEAWAY_BANK_SHOT
    
}
# Filter for jump shots
pbp_df = pbp_df[pbp_df['EVENTMSGACTIONTYPE'].isin(JUMP_SHOT_ACTION_TYPES)].copy()


## Creating our X and Y columns

In [23]:
df = pbp_df.copy()
# Step 1: Create the "Shot Made" column
df['Shot Made'] = df['EVENTMSGTYPE'].apply(lambda x: 1 if x == 1 else 0)

# Step 2: Create "Prev Shot Made" column (Previous shot made for the same player in the same game)
df['Prev Shot Made'] = df.groupby(['GAME_ID', 'PLAYER1_NAME'])['Shot Made'].shift(1)

# Drop rows where "Prev Shot Made" is NaN
df = df.dropna(subset=['Prev Shot Made'])

# Check the results
df[['GAME_ID', 'EVENTNUM', 'PLAYER1_NAME', 'Shot Made', 'Prev Shot Made']]


Unnamed: 0,GAME_ID,EVENTNUM,PLAYER1_NAME,Shot Made,Prev Shot Made
6,22300061,14,Taurean Prince,1,1.0
30,22300061,43,D'Angelo Russell,0,0.0
40,22300061,58,Kentavious Caldwell-Pope,0,1.0
46,22300061,66,Jamal Murray,0,1.0
52,22300061,76,Taurean Prince,0,1.0
...,...,...,...,...,...
567658,22301198,657,Brice Sensabaugh,0,0.0
567662,22301198,661,Pat Spencer,1,0.0
567663,22301198,662,Kira Lewis Jr.,0,1.0
567667,22301198,667,Lester Quinones,0,0.0


## Statistical Tests

### Chi-Squared

In [24]:
df_clean = df.copy()

# Now create the contingency table for the chi-square test
contingency_table = pd.crosstab(df_clean['Prev Shot Made'], df_clean['Shot Made'])

# Perform the Chi-Square test for independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the result
print(f'Chi-Square Statistic: {chi2}')
print(f'p-value: {p}')
print(f'Degrees of Freedom: {dof}')
print(f'Expected Counts: \n{expected}')

# Interpret the result
alpha = 0.05  # Common significance level
if p < alpha:
    print("The test suggests that the previous shot made and the current shot made are dependent.")
else:
    print("The test suggests that the previous shot made and the current shot made are independent.")


Chi-Square Statistic: 7.086459274716587
p-value: 0.0077668547859172
Degrees of Freedom: 1
Expected Counts: 
[[39234.01864277 26791.98135723]
 [28338.98135723 19352.01864277]]
The test suggests that the previous shot made and the current shot made are dependent.


### Basic Logistic Model

In [25]:
# Create a logistic regression model
X = df_clean[['Prev Shot Made']]  # Predictor (previous shot outcome)
X = sm.add_constant(X)  # Add an intercept
y = df_clean['Shot Made']  # Target variable (current shot outcome)

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.675254
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              Shot Made   No. Observations:               113717
Model:                          Logit   Df Residuals:                   113715
Method:                           MLE   Df Model:                            1
Date:                Tue, 27 May 2025   Pseudo R-squ.:               4.637e-05
Time:                        17:01:53   Log-Likelihood:                -76788.
converged:                       True   LL-Null:                       -76791.
Covariance Type:            nonrobust   LLR p-value:                  0.007616
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -0.3678      0.008    -46.462      0.000      -0.383      -0.352
Prev Shot Made   

### Conditional Probabilities

In [26]:
cond_probs = df_clean.groupby('Prev Shot Made')['Shot Made'].mean()
print(cond_probs)


Prev Shot Made
0.0    0.409081
1.0    0.401208
Name: Shot Made, dtype: float64


## Conclusion

### Results Next Steps

The results of my initial analysis suggest that while the effect of the previous shot outcome on the current shot being made is statistically significant, the **magnitude of this effect is minimal**. The coefficient for the previous shot made variable in the logistic regression is negative and small, and the model’s pseudo R² value is virtually zero. This indicates that although there may be some dependency between successive shots, this relationship **explains almost none of the variation** in shot outcomes.

This finding is somewhat surprising to me, especially given my personal belief in the existence of the "hot hand" phenomenon. Intuitively, we often perceive players to go on streaks—making multiple shots in a row—suggesting a temporary boost in performance or confidence. Yet the data, at this level of analysis, appear to contradict that narrative.

However, this discrepancy between perception and statistical evidence has been a well-documented tension in previous research, and it likely stems from the **high degree of variability in shot outcomes**. Factors such as shot distance, defensive pressure, game context, and individual shooting style introduce a significant amount of noise that may obscure more subtle sequential patterns. Without accounting for these confounding variables, it becomes difficult to isolate the true effect of a prior shot on subsequent performance.

---

### Improving the Model

Given the size of the dataset, I believe there is an opportunity to refine the analysis by **reducing variation and improving control variables**. A more targeted approach could yield more meaningful insights. For instance, a promising direction would be to:

- **Narrow the analysis to specific shot types**, such as pull-up three-pointers, where mechanics and decision-making are more consistent.
- **Limit the player sample** to those who attempt a high volume of three-pointers per game (e.g., at least 5 attempts per game). This would reduce inter-player variability and focus the analysis on players more likely to exhibit streaky behavior.
- **Include more contextual variables**, such as time remaining on the shot clock, game quarter, or score differential, which could affect shot quality and decision-making.
- **Incorporate player fixed effects** in the model to control for individual shooting tendencies and ability. This would allow the model to focus on within-player variation rather than being dominated by differences between players.
- Consider using **sequence-based models**, such as Markov chains or hidden Markov models, which may be better suited to capturing streaky behavior and underlying "hot" or "cold" shooting states.

---

### Final Thoughts

In sum, the current analysis suggests that the hot hand effect is not strongly supported by a simple logistic regression using all shot types and players. However, this does not rule out its existence. Instead, it highlights the importance of controlling for **context and player-level effects** in uncovering more nuanced performance patterns. Future work will focus on refining the dataset and model to better account for variability and test the hot hand hypothesis under more controlled and realistic conditions.
