# Blitz Data Acquisition

## Objective
Extract NFL play-by-play data from NFLfastR and prepare blitz prediction dataset.

## Data Pipeline
1. Load raw PBP data from NFLfastR
2. Extract required columns for blitz model
3. Clean data (handle missing values, remove invalid rows)
4. Save to processed directory

In [11]:
import sys
import logging
from pathlib import Path

import pandas as pd
import numpy as np

# Setup paths
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import from src
from src.utils.config import (
    BLITZ_COLUMNS,
    RAW_DATA_PATH,
    PROCESSED_DATA_PATH,
    BLITZ_TARGET,
)
from src.data.load_data import load_nfl_pbp, extract_blitz_features
from src.data.clean_data import clean_blitz_data, validate_blitz_data, get_class_distribution

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create data directories
RAW_DATA_PATH.mkdir(parents=True, exist_ok=True)
PROCESSED_DATA_PATH.mkdir(parents=True, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Raw data path: {RAW_DATA_PATH}")
print(f"Processed data path: {PROCESSED_DATA_PATH}")

Project root: c:\Users\quays\source\repos\Defensive-Intelligence-Predictor
Raw data path: c:\Users\quays\source\repos\Defensive-Intelligence-Predictor\data\raw
Processed data path: c:\Users\quays\source\repos\Defensive-Intelligence-Predictor\data\processed


In [14]:
# Install missing dependencies for nfl_data_py
import subprocess
import sys

# Install required dependencies
subprocess.check_call([sys.executable, "-m", "pip", "install", "appdirs", "requests", "-q"])
print("✓ Dependencies installed")

# Verify nfl_data_py is installed
try:
    import nfl_data_py as nfl
    print("✓ nfl_data_py is available")
except ImportError as e:
    print(f"✗ nfl_data_py import failed: {e}")


✓ Dependencies installed
✓ nfl_data_py is available


In [9]:
import subprocess
import sys

# Install nfl_data_py without rebuilding dependencies
subprocess.check_call([sys.executable, "-m", "pip", "install", "nfl_data_py", "--no-deps", "-q"])
print("nfl_data_py installed successfully!")


nfl_data_py installed successfully!


## Step 1: Load NFL Play-by-Play Data from NFLfastR

Loading 3 seasons (2021-2023) to get sufficient data for model training.

In [21]:
# Reload module
import importlib
import sys
if 'src.data.load_data' in sys.modules:
    importlib.reload(sys.modules['src.data.load_data'])
    from src.data.load_data import load_nfl_pbp

# Select seasons to load
seasons = [2021, 2022, 2023]

try:
    # Load raw PBP data
    pbp_raw = load_nfl_pbp(
        seasons=seasons,
        columns=BLITZ_COLUMNS,
    )
    
    print(f"\nLoaded {len(pbp_raw)} total plays")
    print(f"Sample of columns: {list(pbp_raw.columns[:20])}")
    print(f"'number_of_pass_rushers' in pbp_raw: {'number_of_pass_rushers' in pbp_raw.columns}")
    print(f"'blitz' in pbp_raw: {'blitz' in pbp_raw.columns}")
    
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
    pbp_raw = None


INFO:src.data.load_data:Loading NFL PBP data for seasons: [2021, 2022, 2023]


2021 done.
2022 done.
2023 done.
Downcasting floats.


INFO:src.data.load_data:Loaded 149021 plays
INFO:src.data.load_data:Filtered to 106796 offensive plays
INFO:src.data.load_data:Created blitz feature from number_of_pass_rushers



Loaded 106796 total plays
Sample of columns: ['play_id', 'game_id', 'old_game_id_x', 'home_team', 'away_team', 'season_type', 'week', 'posteam', 'posteam_type', 'defteam', 'side_of_field', 'yardline_100', 'game_date', 'quarter_seconds_remaining', 'half_seconds_remaining', 'game_seconds_remaining', 'game_half', 'quarter_end', 'drive', 'sp']
'number_of_pass_rushers' in pbp_raw: True
'blitz' in pbp_raw: True


## Step 2: Extract Blitz Features

Select only the columns needed for the blitz model.

In [None]:
# Extract blitz features
if pbp_raw is not None:
    pbp_features = extract_blitz_features(pbp_raw, BLITZ_COLUMNS)
    
    print(f"\nFeatures extracted shape: {pbp_features.shape}")
    print(f"Columns: {list(pbp_features.columns)}")
    print(f"\nMissing values before cleaning:")
    print(pbp_features.isnull().sum())
    print(f"\nData types:")
    print(pbp_features.dtypes)
else:
    print("Error: pbp_raw is not defined. Please run the previous cell first.")
    pbp_features = None


INFO:src.data.load_data:Extracting blitz features...
INFO:src.data.load_data:Extracted features shape: (106796, 12)


Columns in pbp_raw: ['play_id', 'game_id', 'old_game_id_x', 'home_team', 'away_team', 'season_type', 'week', 'posteam', 'posteam_type', 'defteam', 'side_of_field', 'yardline_100', 'game_date', 'quarter_seconds_remaining', 'half_seconds_remaining']... (398 total)
'blitz' in pbp_raw: True

BLITZ_COLUMNS: ['down', 'ydstogo', 'yardline_100', 'quarter', 'game_seconds_remaining', 'score_differential', 'offense_personnel', 'defense_personnel', 'formation', 'shotgun', 'motion', 'blitz']

Features extracted shape: (106796, 12)
Columns in pbp_features: ['down', 'ydstogo', 'yardline_100', 'quarter', 'game_seconds_remaining', 'score_differential', 'offense_personnel', 'defense_personnel', 'formation', 'shotgun', 'motion', 'blitz']
'blitz' in pbp_features: True

✓ Blitz column successfully extracted!
Blitz value counts:
blitz
0    89747
1    17049
Name: count, dtype: int64

Missing values before cleaning:
down                        410
ydstogo                       0
yardline_100                  

## Step 3: Clean Data

Handle missing values and ensure data quality.

In [23]:
# Clean blitz data
pbp_cleaned = clean_blitz_data(pbp_features, target_col=BLITZ_TARGET)

print(f"\nCleaned data shape: {pbp_cleaned.shape}")
print(f"\nRemaining missing values:")
print(pbp_cleaned.isnull().sum())

# Get class distribution
class_dist = get_class_distribution(pbp_cleaned, target_col=BLITZ_TARGET)
print(f"\nClass distribution: {class_dist}")

INFO:src.data.clean_data:Starting data cleaning. Shape: (106796, 12)
INFO:src.data.clean_data:Removed 0 rows with missing target
INFO:src.data.clean_data:Removed 0 rows with all null features
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  clean_df[col].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object


Cleaned data shape: (106796, 12)

Remaining missing values:
down                      410
ydstogo                     0
yardline_100                0
quarter                     0
game_seconds_remaining      0
score_differential          0
offense_personnel           0
defense_personnel           0
formation                   0
shotgun                     0
motion                      0
blitz                       0
dtype: int64

Class distribution: {'counts': {0: 89747, 1: 17049}, 'percentages': {0: 84.03591894827521, 1: 15.964081051724785}}


## Step 4: Validate Data

Ensure all required columns are present and valid.

In [24]:
# Validate cleaned data
try:
    validate_blitz_data(pbp_cleaned, BLITZ_COLUMNS)
    print("\n✓ Data validation passed!")
except ValueError as e:
    print(f"\n✗ Validation error: {e}")

down    410
dtype: int64
INFO:src.data.clean_data:Data validation passed



✓ Data validation passed!


## Step 5: Save Cleaned Data

Save the cleaned dataset to the processed directory for next phase.

In [25]:
# Save cleaned data
output_file = PROCESSED_DATA_PATH / "blitz_data_cleaned.csv"
pbp_cleaned.to_csv(output_file, index=False)

print(f"\n✓ Saved cleaned data to: {output_file}")
print(f"  Shape: {pbp_cleaned.shape}")
print(f"  Size: {output_file.stat().st_size / 1024 / 1024:.2f} MB")

# Save data info
info_file = PROCESSED_DATA_PATH / "blitz_data_info.txt"
with open(info_file, "w") as f:
    f.write("Blitz Model Dataset Info\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Total plays: {len(pbp_cleaned)}\n")
    f.write(f"Features: {len(pbp_cleaned.columns) - 1}\n")
    f.write(f"Blitz plays: {(pbp_cleaned[BLITZ_TARGET] == 1).sum()}\n")
    f.write(f"No blitz plays: {(pbp_cleaned[BLITZ_TARGET] == 0).sum()}\n")
    f.write(f"\nColumns: {', '.join(pbp_cleaned.columns)}\n")

print(f"✓ Saved info to: {info_file}")


✓ Saved cleaned data to: c:\Users\quays\source\repos\Defensive-Intelligence-Predictor\data\processed\blitz_data_cleaned.csv
  Shape: (106796, 12)
  Size: 10.14 MB
✓ Saved info to: c:\Users\quays\source\repos\Defensive-Intelligence-Predictor\data\processed\blitz_data_info.txt


## Summary

✓ Loaded NFL PBP data from NFLfastR  
✓ Extracted blitz features  
✓ Cleaned and validated data  
✓ Saved to processed directory  

**Next Phase**: Feature Engineering & Model Training (02_feature_engineering.ipynb)