In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 🏈 Predicting Blitzes Using Pre-Snap Behavior

**Authors:**  
- Chris Doyle (christopherdoyle@college.harvard.edu)  
- Hans Elasri (hanselasri@college.harvard.edu)  
- Thomas Garity (tgarity@college.harvard.edu)  
- Rishi Hazra (rishihazra@college.harvard.edu)  
- Chris Ruaño (cruano@college.harvard.edu)

---

## Project Summary

Blitzing is one of the most aggressive and high-risk strategies in football. When executed well, it can disrupt an offensive drive by forcing the quarterback into rushed decisions; when misread or mistimed, it can leave the defense vulnerable to big plays. Offensive coaches and quarterbacks spend countless hours studying pre-snap cues to anticipate incoming blitzes, while defenses work just as hard to disguise them through subtle shifts, delayed rushes, and simulated pressures.

Our project aims to bring analytics into this equation by predicting whether a defense will blitz, using only pre-snap player tracking data from the NFL Big Data Bowl 2025 dataset. By extracting features such as player positioning, movement trends, alignment depth, and formation structure, we seek to develop machine learning models that systematically classify plays as blitz or non-blitz scenarios. 

A successful model would not only help identify the most telling pre-snap indicators of pressure but also provide a practical tool for offensive strategists to better anticipate and counter defensive blitzes, enhancing both game preparation and real-time decision-making.

---

## Data Structure

Our data can be downloaded from the 2025 Big Data Bowl on [Kaggle](https://www.kaggle.com/competitions/nfl-big-data-bowl-2025/data)  
or using the Kaggle API:

```
kaggle competitions download -c nfl-big-data-bowl-2025
```

**Instructions:**
- Download the dataset and unzip the file `nfl-big-data-bowl-2025.zip`.
- The datasets will be saved in a `\nfl-big-data-bowl-2025` directory within the project repo.
- Ensure that this directory is listed in your `.gitignore` file to avoid pushing large data files to GitHub.

Now, let's check that the data has been downloaded correctly:

In [None]:
# ensure all tables downloaded
print("Data Availabe: ")
os.listdir('./nfl-big-data-bowl-2025/')

In [None]:
!du -sh nfl-big-data-bowl-2025

In [None]:
plays_df = pd.read_csv('./nfl-big-data-bowl-2025/plays.csv')
players_df = pd.read_csv('./nfl-big-data-bowl-2025/players.csv')
games_df = pd.read_csv('./nfl-big-data-bowl-2025/games.csv')
player_play_df = pd.read_csv('./nfl-big-data-bowl-2025/player_play.csv')
tracking_weeks = pd.DataFrame()
for week in range(1, 10):
    current_week = pd.read_csv(f'./nfl-big-data-bowl-2025/tracking_week_{week}.csv')
    tracking_weeks = pd.concat([tracking_weeks, current_week], ignore_index=True)


## Data Summary

### 1. `games.csv`
- **Purpose:** Info about each game.
- **Important Variables:**
  - `gameId` (primary key)
  - `season`, `week`
  - `homeTeamAbbr`, `visitorTeamAbbr`
- **Project Relevance:** Mostly for joining, basic game context (e.g., week, matchup). Not critical for blitz prediction itself.

---

### 2. `plays.csv`
- **Purpose:** Play-level metadata.
- **Important Variables:**
  - `gameId`, `playId` (keys)
  - `quarter`, `down`, `yardsToGo`
  - `possessionTeam`, `defensiveTeam`
  - `offenseFormation`
  - `playDescription`
  - `isDropback` (Boolean: did QB drop back)
  - `pff_passCoverage` (type of defensive coverage)
  - `pff_manZone` (man vs zone coverage)
  - `playAction` (play-action pass or not)
- **Project Relevance:** Crucial for understanding pre-snap situation and defensive alignment; used for labels or features in blitz prediction.

---

### 3. `players.csv`
- **Purpose:** Static player information.
- **Important Variables:**
  - `nflId` (player key)
  - `position`
  - `displayName`
- **Project Relevance:** Helpful for interpreting player roles; minor for pure blitz prediction unless modeling specific player tendencies.

---

### 4. `player_play.csv`
- **Purpose:** Player-level stats per play.
- **Important Variables:**
  - `gameId`, `playId`, `nflId` (keys)
  - `teamAbbr`
  - `wasInitialPassRusher` (binary, key for identifying blitzers!)
  - `causedPressure`
  - `timeToPressureAsPassRusher`
  - `getOffAsPassRusher`
- **Project Relevance:** Allows you to know who blitzed (extra rushers) and pressure dynamics.

---

### 5. `tracking_week_[1-9].csv`
- **Purpose:** Frame-by-frame tracking data (player movement).
- **Important Variables:**
  - `gameId`, `playId`, `nflId`
  - `frameId`, `time`
  - `x`, `y` (player positions)
  - `s`, `a`, `dis` (speed, acceleration, distance moved)
  - `o` (orientation) and `dir` (direction)
  - `event` (snap, ball release, etc.)
- **Project Relevance:** Used to generate **pre-snap features** like player alignments, movement, speed, and timing at snap. Core inputs for neural networks.


In [None]:
plays_df.head()

In [None]:
players_df.head()

In [None]:
games_df.head()

In [None]:
player_play_df.head()

In [None]:
tracking_weeks.head()

We utilize only BEFORE_SNAP data, as it represents the complete set of information accessible when the decision to blitz must be made during actual gameplay.

In [None]:
tracking_weeks_filtered = tracking_weeks[tracking_weeks['frameType'] == 'BEFORE_SNAP' ]
tracking_weeks_filtered.head()

In [None]:
master_df = pd.DataFrame()

In [None]:
# merge player_play_df with players_df to get player names
master_df = player_play_df.merge(players_df[['nflId', 'displayName', 'position']], on='nflId', how='left')
master_df.head()

In [None]:
# merge player_play_df with plays_df on both playId and gameId
master_df = master_df.merge(plays_df, on=['playId', 'gameId'], how='left')
master_df.head()

In [None]:
# merge master_df with games_df on gameId
master_df = master_df.merge(games_df, on='gameId', how='left')
master_df.head()

In [None]:
# merge master_df with tracking_weeks on gameId, playId, and nflId
master_df = master_df.merge(tracking_weeks, on=['gameId', 'playId', 'nflId'], how='left')
master_df.head()

In [None]:
tracking_weeks.head()

## Blitz Labeling Methodology

To label whether a play is a blitz, we adopt a method inspired by Dominic Borsani’s approach in the 2023 NFL Big Data Bowl. Rather than relying directly on provided scouting labels, we infer blitz likelihood based on **pre-snap defender behavior** observable in the tracking data.

Specifically, we use the following steps:

- Merge tracking data (`tracking_week_*`) with play metadata from `plays.csv` to access line of scrimmage (`LOS`) information.
- Identify **frames where `frameType == 'SNAP'`**, which correspond to the official ball snap moment.
- For each defender at ball snap:
  - **Distance from LOS**: Compute the absolute difference between the player’s `x` coordinate and the line of scrimmage (`LOS_x`), adjusted for play direction.
  - **Motion Toward LOS**: Estimate whether a player is moving toward the ball using the `dir` (direction) and `playDirection`.
  - **Velocity Toward LOS**: Calculate as `speed × motion_toward_ball`.
- A defender is flagged as a **likely blitzer** if:
  - They are **within 5 yards** of the line of scrimmage at the snap, **and**
  - They are **moving toward the LOS at greater than 1.5 yards per second**.
- At the play level, we sum the number of players flagged as likely blitzers.

We define a **blitz play** as one where **more than 1 defender** meets the blitz criteria at the snap.  
This threshold aligns with the intuition that sending multiple extra rushers constitutes a true blitz rather than just a normal pass rush.

---

### Reference

Borsani, D. (2023). *Beat the Offensive Line: Using Data to Determine Blitz Strategy.* NFL Big Data Bowl, Finalist Project.


In [None]:
# count unique plays
unique_plays = tracking_weeks[['gameId', 'playId']].drop_duplicates()
print(f"Number of unique plays: {len(unique_plays)}")

In [None]:
plays_subset = plays_df[['gameId', 'playId', 'yardlineNumber', 'yardlineSide', 'absoluteYardlineNumber']]
tracking = tracking_weeks.merge(plays_subset, on=['gameId', 'playId'], how='left')

# keep only frames right at ball snap
snap_tracking = tracking[tracking['frameType'] == 'SNAP']

# calculate LOS x-coordinate
def calculate_los_x(playDirection, absoluteYardlineNumber):
    if playDirection == 'left':
        return 100 - absoluteYardlineNumber
    else:
        return absoluteYardlineNumber

snap_tracking['LOS_x'] = snap_tracking.apply(
    lambda row: calculate_los_x(row['playDirection'], row['absoluteYardlineNumber']), axis=1
)

# distance from LOS
snap_tracking['dist_from_LOS'] = np.abs(snap_tracking['x'] - snap_tracking['LOS_x'])

# motion toward ball based on play direction
def motion_toward_ball(row):
    if row['playDirection'] == 'left':
        return np.cos(np.deg2rad(row['dir']))
    else:
        return np.cos(np.deg2rad(row['dir'] - 180))

snap_tracking['motion_toward_ball'] = snap_tracking.apply(motion_toward_ball, axis=1)

# velocity toward LOS
snap_tracking['velocity_toward_LOS'] = snap_tracking['s'] * snap_tracking['motion_toward_ball']

# likely blitzer = close to LOS and moving fast toward ball
snap_tracking['isLikelyBlitzer'] = (
    (snap_tracking['dist_from_LOS'] <= 5) &
    (snap_tracking['velocity_toward_LOS'] > 1.5)
).astype(int)

# count likely blitzers
num_likely_blitzers = snap_tracking['isLikelyBlitzer'].sum()
print(f"Total number of players flagged as likely blitzers: {num_likely_blitzers}")

# count blitzers per play
blitzer_counts_by_play = snap_tracking.groupby(['gameId', 'playId'])['isLikelyBlitzer'].sum().reset_index()
print(blitzer_counts_by_play.head())

In [None]:
# count plays with 1+ blitzer
plays_with_blitzers = blitzer_counts_by_play[blitzer_counts_by_play['isLikelyBlitzer'] > 0]
num_plays_with_blitzers = len(plays_with_blitzers)

# Count total unique plays
total_plays = len(blitzer_counts_by_play)

# calculate percentage
blitz_percentage = (num_plays_with_blitzers / total_plays) * 100 if total_plays > 0 else 0

print(f"Total plays: {total_plays}")
print(f"Plays with at least one blitzer: {num_plays_with_blitzers}")
print(f"Percentage of plays with blitzers: {blitz_percentage:.2f}%")

# count pass plays
pass_plays = plays_df[plays_df['isDropback'] == 1]
print(f"Number of pass plays: {len(pass_plays)}")

# calculate percentage of pass plays with blitzers
pass_plays_with_blitzers = pass_plays[pass_plays['playId'].isin(plays_with_blitzers['playId'])]
num_pass_plays_with_blitzers = len(pass_plays_with_blitzers)
pass_plays_total = len(pass_plays)
pass_blitz_percentage = (num_pass_plays_with_blitzers / pass_plays_total) * 100 if pass_plays_total > 0 else 0
print(f"Total pass plays: {pass_plays_total}")
print(f"Pass plays with at least one blitzer: {num_pass_plays_with_blitzers}")
print(f"Percentage of pass plays with blitzers: {pass_blitz_percentage:.2f}%")

# breakdown of plays by number of blitzers
blitzer_distribution = blitzer_counts_by_play['isLikelyBlitzer'].value_counts().sort_index()
print("\nDistribution of blitzers per play:")
print(blitzer_distribution)

# classify plays as blitzes if > threshold_for_blitz_play likely blitzers
threshold_for_blitz_play = 1
plays_with_significant_blitz = blitzer_counts_by_play[blitzer_counts_by_play['isLikelyBlitzer'] > threshold_for_blitz_play]
print(f"\nPlays with more than {threshold_for_blitz_play} blitzer(s): {len(plays_with_significant_blitz)}")
print(f"Percentage: {(len(plays_with_significant_blitz) / total_plays) * 100:.2f}%")

## Baseline Model: Logistic Regression

### Objective

We attempt to predict whether a **defender will blitz** (`isLikelyBlitzer = 1`) based **only on information available before the snap**. In NFL game situations, quarterbacks and offensive coordinators must make real-time protection adjustments based on pre-snap alignments, player movement, and game context. To mirror this reality, our model is strictly limited to **pre-snap observable features**.

---

## Rationale for Feature Selection

Our features are selected based on what a quarterback or coach could realistically observe live:

- **Tracking Data (Pre-Snap, `tracking_week_*`)**:
  - `x`, `y`: Player's location on the field at the snap.
  - `s`: Player speed (yards/second) — shows if a defender is creeping forward.
  - `a`: Player acceleration (yards/second²) — sudden movement may indicate blitz.
  - `dis`: Distance traveled since the prior frame.
  - `o`: Player orientation (which direction their body is facing).
  - `dir`: Direction of player motion.

- **Play Metadata (`plays.csv`)**:
  - `down`: Current down (1st, 2nd, 3rd, 4th) — blitzes more common on late downs.
  - `yardsToGo`: Distance to first down — long distances can encourage blitzing.
  - `playClockAtSnap`: Remaining play clock time — last-second snaps may alter defensive behavior.
  - `offenseFormation`: Offensive alignment (categorical).
  - `pff_passCoverage`: Defensive coverage scheme (categorical).
  - `pff_manZone`: Whether the defense was playing man or zone.

- **Player Metadata (`players.csv`)**:
  - `position`: Player position group (e.g., LB, CB, S, DE).

All categorical variables are **one-hot encoded** to integrate cleanly with our modeling approach.

By using only pre-snap tracking and contextual information, the model realistically approximates how a quarterback or coach would scout for potential blitzers in real time.

---

## Preprocessing Steps

1. **Filter to defenders only**: Identify players on the defensive team (`club != possessionTeam`).
2. **Restrict frames** to `frameType == 'SNAP'`: Capture exact alignment and motion at the moment of the snap.
3. **Merge metadata**: Add player position and play context (down, distance, formation, coverage).
4. **Handle missing values**: Fill missing categorical fields (e.g., `offenseFormation`) with "Unknown".
5. **One-hot encode** categorical features: Transform `position`, `offenseFormation`, `pff_passCoverage`, and `pff_manZone`.
6. **Prepare X and y**:
   - **X**: Pre-snap physical and contextual features.
   - **y**: Binary label (`isLikelyBlitzer`), engineered from motion and distance heuristics.

---

## Baseline Model

- **Model:** Logistic Regression
- **Class Imbalance Handling:** 
  - Apply `class_weight='balanced'` to automatically adjust for the minority class (blitzers).
  - This prevents the model from simply predicting "no blitz" every time, and encourages sensitivity to rare blitz behaviors.
- **Train-Test Split:** 
  - 80% of data used for training, 20% held out for testing.
  - Stratified sampling is used to preserve the blitz/non-blitz ratio.

Rishi, we should merge these into master_df, unless that would make compute time intractable.

In [None]:
# merge in play-level context if not already there
if 'possessionTeam' not in snap_tracking.columns:
    snap_tracking = snap_tracking.merge(plays_df[['gameId', 'playId', 'possessionTeam']], on=['gameId', 'playId'], how='left')

# filter to defenders
snap_tracking_defense = snap_tracking[snap_tracking['club'] != snap_tracking['possessionTeam']].copy()

# merge player positions
if 'position' not in snap_tracking_defense.columns:
    snap_tracking_defense = snap_tracking_defense.merge(players_df[['nflId', 'position']], on='nflId', how='left')

# merge play metadata
metadata_cols = ['gameId', 'playId', 'down', 'yardsToGo', 'offenseFormation', 'playClockAtSnap', 'pff_passCoverage', 'pff_manZone']
if not all(col in snap_tracking_defense.columns for col in ['down', 'yardsToGo', 'offenseFormation']):
    snap_tracking_defense = snap_tracking_defense.merge(plays_df[metadata_cols], on=['gameId', 'playId'], how='left')

I didn't run code, so explicitly check for missingness before imputing NAs.

In [None]:
# fill missing categorical fields
snap_tracking_defense['position'] = snap_tracking_defense['position'].fillna('Unknown')
snap_tracking_defense['offenseFormation'] = snap_tracking_defense['offenseFormation'].fillna('Unknown')
snap_tracking_defense['pff_passCoverage'] = snap_tracking_defense['pff_passCoverage'].fillna('Unknown')
snap_tracking_defense['pff_manZone'] = snap_tracking_defense['pff_manZone'].fillna('Unknown')

In [None]:
# one-hot encode categorical columns
snap_tracking_encoded = pd.get_dummies(
    snap_tracking_defense,
    columns=['position', 'offenseFormation', 'pff_passCoverage', 'pff_manZone'],
    drop_first=True
)

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

In [None]:
# logistic regression
logreg = LogisticRegression(max_iter=1000, class_weight='balanced')
logreg.fit(X_train, y_train)

In [None]:
# evaluate results
y_pred = logreg.predict(X_test)
print("Classification Report (Logistic Regression Baseline):")
print(classification_report(y_test, y_pred))