# College Basketball Play-by-Play Data Collection
## Using CollegeBasketballData.com API for 2025/26 Season

This notebook demonstrates how to use the CollegeBasketballData.com (CBBD) API to collect play-by-play data for the 2025/26 college basketball season.

### Prerequisites
- Python 3.7+
- API key from CollegeBasketballData.com (register at https://collegebasketballdata.com)
- Install required packages: `pip install cbbd pandas numpy`

## 1. Setup and Import Libraries

In [1]:
# Import required libraries
import cbbd
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from cbbd.rest import ApiException
from pprint import pprint

# Change working directory to Desktop
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
os.chdir(desktop_path)
print(f"Working directory changed to: {os.getcwd()}")

Working directory changed to: C:\Users\shank.subramani_betf\Desktop


## 2. Configure API Authentication

The CBBD API uses Bearer token authentication. You'll be prompted to enter your API key securely.

In [2]:
# Configure API key
import getpass

# Prompt user for API key (input will be hidden for security)
print("Please enter your CollegeBasketballData.com API key")
print("(Get your free key at: https://collegebasketballdata.com)")
api_key = getpass.getpass("API Key: ")

# Alternatively, use environment variable if already set
# api_key = os.environ.get('CBBD_API_KEY')

# Validate that a key was entered
if not api_key or api_key.strip() == '':
    raise ValueError("API key is required. Please run this cell again and enter your key.")

# Configure the API client
configuration = cbbd.Configuration(
    host="https://api.collegebasketballdata.com",
    access_token=api_key
)

print("✓ API configuration completed successfully!")

Please enter your CollegeBasketballData.com API key
(Get your free key at: https://collegebasketballdata.com)
✓ API configuration completed successfully!


## 3. Set Season Parameter

For the 2025/26 season, use `season=2026` (NCAA uses the end year of the academic year)

In [3]:
# Set the season (use 2026 for 2025/26 season)
season = 2026
print(f"Season set to: {season-1}/{season}")

Season set to: 2025/2026


## 4. Choose Your Data Collection Method

The CBBD API provides several endpoints for accessing play-by-play data. Choose the method that best fits your needs:

- **Method 1 (4a)**: Get plays for a **specific game** (when you know the game ID)
- **Method 2 (4b)**: Get plays by **date range** (all games between two dates)
- **Method 3 (4c)**: Get plays by **team** (entire season for one team)
- **Method 4 (4d)**: Get **shooting plays only** (for shot charts and shooting analysis)

**You only need to run ONE of the methods below (4a, 4b, 4c, or 4d) based on your use case.**

### 4a. Method 1: Get Plays for a Specific Game

First, find games for a specific date to get game IDs, then retrieve plays for a specific game

In [12]:
# Combined: Get games for a date and collect play-by-play for all of them
from datetime import datetime
import time

# Prompt user for the date
date_input = input("Enter date to search for games (YYYY-MM-DD) or press Enter for today: ").strip()

# Use today's date if nothing entered
if not date_input:
    search_date = datetime.now()
else:
    # Parse the string into a datetime object
    search_date = datetime.strptime(date_input, '%Y-%m-%d')

with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get games for the specified date
        games = games_api.get_games(
            start_date_range=search_date,
            end_date_range=search_date,
            season=season
        )
        
        # Convert to DataFrame
        games_df = pd.DataFrame([g.to_dict() for g in games])
        
        if len(games_df) > 0:
            # Collect play-by-play for all games
            all_plays = []
            
            for idx, game in games_df.iterrows():
                game_id = game['id']
                
                try:
                    # Get plays for this game
                    plays = plays_api.get_plays(game_id=game_id)
                    
                    # Convert to list of dicts
                    game_plays = [p.to_dict() for p in plays]
                    
                    # Add to our collection
                    all_plays.extend(game_plays)
                    
                    # Small delay to be nice to the API
                    time.sleep(0.5)
                    
                except ApiException as e:
                    continue
            
            # Convert all plays to DataFrame
            plays_df = pd.DataFrame(all_plays)
            
        else:
            plays_df = pd.DataFrame()
        
    except ApiException as e:
        games_df = pd.DataFrame()
        plays_df = pd.DataFrame()

# Quick summary
print(f"Games: {len(games_df)}, Plays: {len(plays_df):,}")

Games: 11, Plays: 5,172


In [13]:
plays_df

Unnamed: 0,id,sourceId,gameId,gameSourceId,gameStartDate,season,seasonType,gameType,playType,period,...,playText,participants,isHomeTeam,teamId,team,conference,opponentId,opponent,opponentConference,shotInfo
0,43562565,401822888116889907,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Start game,[],,,,,,,,
1,43562567,401822888116889909,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball lost by Butler,"[{'name': 'Drayton Jones', 'id': 1610}]",True,34.0,Butler,Big East,279.0,St. John's,Big East,
2,43562566,401822888116889908,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball won by St. John's,"[{'name': 'Zuby Ejiofor', 'id': 1424}]",False,279.0,St. John's,Big East,34.0,Butler,Big East,
3,43562568,401822888116889913,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,1,...,Ian Jackson misses 13-foot jumper,"[{'name': 'Ian Jackson', 'id': 243}]",False,279.0,St. John's,Big East,34.0,Butler,Big East,"{'shooter': {'name': 'Ian Jackson', 'id': 243}..."
4,43562569,401822888116889914,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,1,...,Drayton Jones Defensive Rebound.,"[{'name': 'Drayton Jones', 'id': 1610}]",True,34.0,Butler,Big East,279.0,St. John's,Big East,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,43555113,401825423116900592,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,2,...,Aday Mara Defensive Rebound.,"[{'name': 'Aday Mara', 'id': 535}]",False,170.0,Michigan,Big Ten,226.0,Penn State,Big Ten,
5168,43555117,401825423116900987,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Game,2,...,End of Game,[],,,,,,,,
5169,43555116,401825423116900986,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Period,2,...,End of 2nd half,[],,,,,,,,
5170,43555115,401825423116900942,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Dead Ball Rebound,2,...,Penn State Deadball Team Rebound.,[],True,226.0,Penn State,Big Ten,170.0,Michigan,Big Ten,


### 4b. Method 2: Get Plays by Date Range

This retrieves all plays for all games within a date range - useful for analyzing a specific time period

In [6]:
# Define date range for play-by-play collection
start_date = '2025-11-04'
end_date = '2025-11-07'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get plays for all games in date range
        plays_by_date = plays_api.get_plays_by_date(
            season=season,
            start_date=start_date,
            end_date=end_date
        )
        
        # Convert to DataFrame
        plays_date_df = pd.DataFrame([p.to_dict() for p in plays_by_date])
        
        print(f"Total plays from {start_date} to {end_date}: {len(plays_date_df)}")
        print(f"\nUnique games: {plays_date_df['game_id'].nunique() if len(plays_date_df) > 0 else 0}")
        print(f"Play types: {plays_date_df['type'].value_counts().head(10) if len(plays_date_df) > 0 else 'No data'}")
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_date: {e}")
        plays_date_df = pd.DataFrame()

ValidationError: 1 validation error for GetPlaysByDate
var_date
  field required (type=value_error.missing)

### 4c. Method 3: Get Plays by Team

This retrieves all plays for a specific team's entire season - ideal for team-specific analysis

In [None]:
# Get play-by-play for a specific team (example: Duke)
team_name = 'Duke'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get all plays for the team's season
        team_plays = plays_api.get_plays_by_team(
            season=season,
            team=team_name
        )
        
        # Convert to DataFrame
        team_plays_df = pd.DataFrame([p.to_dict() for p in team_plays])
        
        print(f"Total plays for {team_name} in {season-1}/{season} season: {len(team_plays_df)}")
        if len(team_plays_df) > 0:
            print(f"\nGames played: {team_plays_df['game_id'].nunique()}")
            print(f"\nPlay type distribution:")
            print(team_plays_df['type'].value_counts())
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_team: {e}")
        team_plays_df = pd.DataFrame()

## 5. Process a Single Game (Start Here!)

Before processing all games, let's focus on getting ONE game right. This approach helps us:
- Understand the exact data structure
- Debug any parsing issues
- Validate our output before scaling up

In [14]:
# Step 5a: Pick ONE game and examine its structure
# This is crucial - we need to understand the data before processing it

# Get unique game IDs
if 'plays_df' in locals() and len(plays_df) > 0:
    unique_games = plays_df['gameId'].unique()
    print(f"Total unique games: {len(unique_games)}")
    
    # Pick the first game
    sample_game_id = unique_games[0]
    print(f"\nSelected game ID: {sample_game_id}")
    
    # Filter for just this game
    single_game_df = plays_df[plays_df['gameId'] == sample_game_id].copy()
    print(f"Total plays in this game: {len(single_game_df)}")
    
    # Get teams playing
    teams = single_game_df[single_game_df['team'].notna()]['team'].unique()
    print(f"Teams: {teams}")
    
    # Show all columns available
    print(f"\nAll columns ({len(single_game_df.columns)}):")
    for col in single_game_df.columns:
        print(f"  - {col}")
else:
    print("No plays_df available - run a data collection method first (4a, 4b, 4c, or 4d)")

Total unique games: 11

Selected game ID: 214340
Total plays in this game: 429
Teams: ['Butler' "St. John's"]

All columns (29):
  - id
  - sourceId
  - gameId
  - gameSourceId
  - gameStartDate
  - season
  - seasonType
  - gameType
  - playType
  - period
  - clock
  - secondsRemaining
  - homeScore
  - awayScore
  - homeWinProbability
  - scoringPlay
  - shootingPlay
  - scoreValue
  - wallclock
  - playText
  - participants
  - isHomeTeam
  - teamId
  - team
  - conference
  - opponentId
  - opponent
  - opponentConference
  - shotInfo


In [15]:
# Step 5b: Examine the shotInfo structure for ONE shooting play
# This is where the previous code failed - shotInfo is a DICTIONARY, not an object!

# Get shooting plays from our single game
shooting_plays_single = single_game_df[single_game_df['shootingPlay'] == True]
print(f"Shooting plays in this game: {len(shooting_plays_single)}")

if len(shooting_plays_single) > 0:
    # Get the first shooting play with shotInfo
    sample_shot = shooting_plays_single[shooting_plays_single['shotInfo'].notna()].iloc[0]
    
    print(f"\n{'='*60}")
    print("SAMPLE SHOOTING PLAY")
    print(f"{'='*60}")
    print(f"Play Type: {sample_shot['playType']}")
    print(f"Play Text: {sample_shot['playText']}")
    print(f"Team: {sample_shot['team']}")
    
    print(f"\n{'='*60}")
    print("shotInfo STRUCTURE (this is the key!)")
    print(f"{'='*60}")
    print(f"Type: {type(sample_shot['shotInfo'])}")
    print(f"\nFull shotInfo content:")
    pprint(sample_shot['shotInfo'])
    
    # Show all available keys
    if isinstance(sample_shot['shotInfo'], dict):
        print(f"\nAvailable keys in shotInfo:")
        for key in sample_shot['shotInfo'].keys():
            print(f"  - {key}: {type(sample_shot['shotInfo'][key])}")
else:
    print("No shooting plays found in this game")

Shooting plays in this game: 147

SAMPLE SHOOTING PLAY
Play Type: JumpShot
Play Text: Ian Jackson misses 13-foot jumper
Team: St. John's

shotInfo STRUCTURE (this is the key!)
Type: <class 'dict'>

Full shotInfo content:
{'assisted': False,
 'assistedBy': {'id': None, 'name': None},
 'location': {'x': 780.2, 'y': 320},
 'made': False,
 'range': 'jumper',
 'shooter': {'id': 243, 'name': 'Ian Jackson'}}

Available keys in shotInfo:
  - shooter: <class 'dict'>
  - made: <class 'bool'>
  - range: <class 'str'>
  - assisted: <class 'bool'>
  - assistedBy: <class 'dict'>
  - location: <class 'dict'>


In [16]:
# Step 5c: CORRECT parsing - shotInfo is a DICTIONARY, use .get() not hasattr()!
# Process all shooting plays for our single game
import re

def extract_shot_data(row):
    """Extract all shot data from a single play row - handles dictionary structure correctly."""
    shot_info = row.get('shotInfo')
    play_text = row.get('playText') or ''
    
    # Start with basic play info
    shot_data = {
        'play_id': row.get('id'),
        'game_id': row.get('gameId'),
        'period': row.get('period'),
        'clock': row.get('clock'),
        'seconds_remaining': row.get('secondsRemaining'),
        'team': row.get('team'),
        'team_id': row.get('teamId'),
        'opponent': row.get('opponent'),
        'is_home_team': row.get('isHomeTeam'),
        'play_type': row.get('playType'),
        'play_text': play_text,
        'score_value': row.get('scoreValue'),
        'home_score': row.get('homeScore'),
        'away_score': row.get('awayScore'),
    }
    
    # Parse from play text
    # Distance: "24-foot" -> 24
    distance_match = re.search(r'(\d+)-foot', play_text)
    shot_data['distance'] = int(distance_match.group(1)) if distance_match else None
    
    # Three pointer: "three point" (with space)
    shot_data['is_three_from_text'] = 'three point' in play_text.lower()
    
    # Made/missed from text
    shot_data['made_from_text'] = ' makes ' in play_text.lower() or ' made ' in play_text.lower()
    
    # Extract from shotInfo dictionary (THIS IS THE KEY FIX!)
    if shot_info and isinstance(shot_info, dict):
        # Basic shot info
        shot_data['made'] = shot_info.get('made')
        shot_data['assisted'] = shot_info.get('assisted')
        shot_data['three_pointer'] = shot_info.get('threePointer')
        
        # Shooter info (nested dict)
        shooter = shot_info.get('shooter')
        if shooter and isinstance(shooter, dict):
            shot_data['shooter_name'] = shooter.get('name')
            shot_data['shooter_id'] = shooter.get('id')
        
        # Location info (nested dict)
        location = shot_info.get('location')
        if location and isinstance(location, dict):
            shot_data['x'] = location.get('x')
            shot_data['y'] = location.get('y')
    
    return shot_data

# Process our single game
print(f"Processing game: {sample_game_id}")
print(f"Shooting plays to process: {len(shooting_plays_single)}")

# Extract shot data for each shooting play
single_game_shots = []
for idx, row in shooting_plays_single.iterrows():
    shot_data = extract_shot_data(row.to_dict())
    single_game_shots.append(shot_data)

# Convert to DataFrame
single_game_shots_df = pd.DataFrame(single_game_shots)

print(f"\n{'='*60}")
print("EXTRACTION RESULTS")
print(f"{'='*60}")
print(f"Total shots extracted: {len(single_game_shots_df)}")
print(f"\nColumns extracted ({len(single_game_shots_df.columns)}):")
print(single_game_shots_df.columns.tolist())

# Check what we got
print(f"\n{'='*60}")
print("DATA QUALITY CHECK")
print(f"{'='*60}")
for col in ['made', 'distance', 'x', 'y', 'shooter_name', 'three_pointer', 'is_three_from_text']:
    if col in single_game_shots_df.columns:
        non_null = single_game_shots_df[col].notna().sum()
        if col == 'is_three_from_text':
            # Boolean - count True values
            true_count = single_game_shots_df[col].sum()
            print(f"  {col}: {true_count} three-pointers found")
        else:
            print(f"  {col}: {non_null}/{len(single_game_shots_df)} ({non_null/len(single_game_shots_df)*100:.1f}%)")

Processing game: 214340
Shooting plays to process: 147

EXTRACTION RESULTS
Total shots extracted: 147

Columns extracted (24):
['play_id', 'game_id', 'period', 'clock', 'seconds_remaining', 'team', 'team_id', 'opponent', 'is_home_team', 'play_type', 'play_text', 'score_value', 'home_score', 'away_score', 'distance', 'is_three_from_text', 'made_from_text', 'made', 'assisted', 'three_pointer', 'shooter_name', 'shooter_id', 'x', 'y']

DATA QUALITY CHECK
  made: 147/147 (100.0%)
  distance: 77/147 (52.4%)
  x: 114/147 (77.6%)
  y: 114/147 (77.6%)
  shooter_name: 147/147 (100.0%)
  three_pointer: 0/147 (0.0%)
  is_three_from_text: 36 three-pointers found


In [None]:
# Step 5d: Get roster data using get_game_players API
# Note: API filters by team/date, not game_id

# Get game info from our single game
game_date = single_game_df['gameStartDate'].iloc[0]
game_teams = single_game_df[single_game_df['team'].notna()]['team'].unique()

print(f"Game date: {game_date}")
print(f"Teams: {game_teams}")

with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    
    all_players = []
    
    for team in game_teams:
        try:
            # Get players for this team on this date
            game_players = games_api.get_game_players(
                start_date_range=game_date,
                end_date_range=game_date,
                team=team,
                season=season
            )
            
            # Convert to list of dicts
            players = [p.to_dict() for p in game_players]
            all_players.extend(players)
            print(f"  {team}: {len(players)} player records")
            
        except ApiException as e:
            print(f"  {team}: Error - {e}")

# Convert to DataFrame
game_players_df = pd.DataFrame(all_players)

print(f"\nTotal rows: {len(game_players_df)}")
print(f"Note: Each row is a team's full roster - players are nested in 'players' column")

In [19]:
# Step 5e: Extract players from nested structure
all_players_flat = []

for idx, row in game_players_df.iterrows():
    team = row['team']
    players_list = row['players']
    
    if players_list:
        for player in players_list:
            player['team'] = team  # Add team to each player
            all_players_flat.append(player)

players_flat_df = pd.DataFrame(all_players_flat)

print(f"Total players: {len(players_flat_df)}")
print(f"\nColumns: {players_flat_df.columns.tolist()}")

# Show starters
print(f"\n{'='*60}")
print("STARTERS")
print(f"{'='*60}")
for team in players_flat_df['team'].unique():
    team_starters = players_flat_df[(players_flat_df['team'] == team) & (players_flat_df['starter'] == True)]
    print(f"\n{team} ({len(team_starters)}):")
    for idx, p in team_starters.iterrows():
        print(f"  {p['position']} - {p['name']} ({p['minutes']} min)")


Total players: 19

Columns: ['rebounds', 'freeThrows', 'threePointFieldGoals', 'twoPointFieldGoals', 'fieldGoals', 'offensiveReboundPct', 'freeThrowRate', 'assistsTurnoverRatio', 'gameScore', 'trueShootingPct', 'effectiveFieldGoalPct', 'netRating', 'defensiveRating', 'offensiveRating', 'usage', 'blocks', 'steals', 'assists', 'fouls', 'turnovers', 'points', 'minutes', 'ejected', 'starter', 'position', 'name', 'athleteSourceId', 'athleteId', 'team']

STARTERS

Butler (5):
  G - Michael Ajayi (30 min)
  F - Drayton Jones (21 min)
  G - Azavier Robinson (17 min)
  G - Jamie Kaiser Jr. (28 min)
  G - Finley Bizjack (35 min)

St. John's (5):
  F - Dillon Mitchell (33 min)
  F - Zuby Ejiofor (30 min)
  F - Bryce Hopkins (24 min)
  G - Oziyah Sellers (33 min)
  G - Ian Jackson (22 min)


In [None]:
# Step 5f: Build lineup tracker
# Parse substitutions to track who's on court

import re

def parse_substitution(play_text):
    """
    Parse substitution text.
    Format: "Player Name subbing in/out for Team"
    Returns: {'player': name, 'action': 'in' or 'out', 'team': team_name}
    """
    # Pattern: "Player Name subbing in for Team" or "Player Name subbing out for Team"
    pattern = r'(.+?)\s+subbing\s+(in|out)\s+for\s+(.+?)$'
    match = re.search(pattern, play_text, re.IGNORECASE)
    if match:
        return {
            'player': match.group(1).strip(),
            'action': match.group(2).lower(),
            'team': match.group(3).strip()
        }
    return None

# Get substitution plays from our single game
subs = single_game_df[single_game_df['playType'] == 'Substitution'].copy()
print(f"Total substitutions in game: {len(subs)}")

# Test on sample subs
print("\nTesting substitution parsing:")
print("="*60)
for idx, sub in subs.head(10).iterrows():
    result = parse_substitution(sub['playText'])
    print(f"Text: {sub['playText']}")
    print(f"Parsed: {result}")
    print()

# Count in vs out
subs['parsed'] = subs['playText'].apply(parse_substitution)
subs_valid = subs[subs['parsed'].notna()]
in_count = sum(1 for p in subs_valid['parsed'] if p['action'] == 'in')
out_count = sum(1 for p in subs_valid['parsed'] if p['action'] == 'out')
print(f"Summary: {in_count} 'in', {out_count} 'out' (should be equal)")
print(f"Parse failures: {len(subs) - len(subs_valid)}")

In [None]:
# Step 5g: Track full lineups throughout the game
# Use REAL starters from API, then track substitutions
# Process all subs at same timestamp together before logging

def track_lineups_with_real_starters(game_df, starters_by_team):
    """Track which 5 players are on court for each team throughout the game."""
    # Get teams
    teams = game_df[game_df['team'].notna()]['team'].unique()
    print(f"Teams: {list(teams)}")
    
    # Initialize lineups with REAL starters from API
    lineups = {}
    for team in teams:
        if team in starters_by_team:
            lineups[team] = set(starters_by_team[team])
            print(f"\n{team} starters (from API): {len(lineups[team])}")
            for p in lineups[team]:
                print(f"  - {p}")
        else:
            lineups[team] = set()
            print(f"\n{team}: No starters found in API data")
    
    # Group plays by (period, clock) to process subs together
    game_df_sorted = game_df.sort_values(['period', 'secondsRemaining'], ascending=[True, False])
    
    lineup_log = []
    current_period = None
    current_clock = None
    pending_plays = []
    
    def process_pending_plays():
        """Process all pending plays and log lineup state once at the end."""
        nonlocal pending_plays
        if not pending_plays:
            return
            
        # First, apply all substitutions
        for play in pending_plays:
            if play.get('playType') == 'Substitution':
                team = play.get('team')
                sub_info = parse_substitution(play.get('playText', ''))
                if sub_info and team:
                    player = sub_info['player']
                    action = sub_info['action']
                    if action == 'in':
                        lineups[team].add(player)
                    elif action == 'out':
                        lineups[team].discard(player)
        
        # Then log lineup state for each play at this timestamp
        for play in pending_plays:
            lineup_log.append({
                'play_id': play.get('id'),
                'period': play.get('period'),
                'clock': play.get('clock'),
                'seconds_remaining': play.get('secondsRemaining'),
                'play_type': play.get('playType'),
                'team': play.get('team'),
                'home_lineup': list(lineups.get(teams[0], set())) if len(teams) > 0 else [],
                'away_lineup': list(lineups.get(teams[1], set())) if len(teams) > 1 else [],
                'home_lineup_size': len(lineups.get(teams[0], set())) if len(teams) > 0 else 0,
                'away_lineup_size': len(lineups.get(teams[1], set())) if len(teams) > 1 else 0,
            })
        
        pending_plays = []
    
    # Process plays
    for idx, play in game_df_sorted.iterrows():
        period = play.get('period')
        clock = play.get('clock')
        
        # If we're at a new timestamp, process pending plays first
        if (period, clock) != (current_period, current_clock):
            process_pending_plays()
            current_period = period
            current_clock = clock
        
        pending_plays.append(play.to_dict())
    
    # Process final batch
    process_pending_plays()
    
    return pd.DataFrame(lineup_log), lineups

# Create starters_by_team if not already created
if 'starters_by_team' not in locals():
    starters_by_team = {}
    for team in players_flat_df['team'].unique():
        team_starters = players_flat_df[(players_flat_df['team'] == team) & (players_flat_df['starter'] == True)]
        starters_by_team[team] = team_starters['name'].tolist()

# Run lineup tracking with real starters
lineup_df, final_lineups = track_lineups_with_real_starters(single_game_df, starters_by_team)

print(f"\n{'='*60}")
print("LINEUP TRACKING RESULTS")
print(f"{'='*60}")
print(f"Total plays tracked: {len(lineup_df)}")

# Check lineup sizes (should stay at 5)
print(f"\nLineup size check (should be exactly 5 per team):")
for col in ['home_lineup_size', 'away_lineup_size']:
    sizes = lineup_df[col]
    print(f"  {col}: min={sizes.min()}, max={sizes.max()}, mode={sizes.mode().iloc[0]}")
    
# Show any plays where lineup size != 5
bad_lineups = lineup_df[(lineup_df['home_lineup_size'] != 5) | (lineup_df['away_lineup_size'] != 5)]
if len(bad_lineups) > 0:
    print(f"\n⚠️  {len(bad_lineups)} plays with lineup size != 5")
    print(bad_lineups[['period', 'clock', 'play_type', 'home_lineup_size', 'away_lineup_size']].head(10))
else:
    print(f"\n✓ All plays have exactly 5 players per team!")

In [None]:
# Step 5h: Create full PBP with lineups for every play
# Merge lineup tracking data with full play details

# lineup_df has: play_id, period, clock, play_type, team, home_lineup, away_lineup
# single_game_df has: all the original play details (shotInfo, scores, etc.)

# Merge on play_id
pbp_with_lineups = single_game_df.merge(
    lineup_df[['play_id', 'home_lineup', 'away_lineup', 'home_lineup_size', 'away_lineup_size']],
    left_on='id',
    right_on='play_id',
    how='left'
)

print(f"Full PBP with lineups: {len(pbp_with_lineups)} plays")
print(f"\nNew columns added: home_lineup, away_lineup, home_lineup_size, away_lineup_size")

# Show sample
print(f"\n{'='*60}")
print("SAMPLE: Play with lineup context")
print(f"{'='*60}")
sample = pbp_with_lineups[pbp_with_lineups['playType'] == 'JumpShot'].iloc[0]
print(f"Play: {sample['playText']}")
print(f"Home on court: {sample['home_lineup']}")
print(f"Away on court: {sample['away_lineup']}")

In [None]:
# Step 5i: Create FLAT dataframes for analysis

# Helper to flatten lineup lists into string keys
def lineup_to_key(lineup_list):
    """Convert lineup list to sorted string for grouping."""
    if isinstance(lineup_list, list):
        return ' | '.join(sorted(lineup_list))
    return None

# 1. SHOTS_DF - All shooting plays with extracted details + lineups (FLAT)
shots_df = pbp_with_lineups[pbp_with_lineups['shootingPlay'] == True].copy()

# Extract shot details from shotInfo
for idx, row in shots_df.iterrows():
    shot_info = row.get('shotInfo')
    if shot_info and isinstance(shot_info, dict):
        shots_df.at[idx, 'shooter_name'] = shot_info.get('shooter', {}).get('name')
        shots_df.at[idx, 'shooter_id'] = shot_info.get('shooter', {}).get('id')
        shots_df.at[idx, 'made'] = shot_info.get('made')
        shots_df.at[idx, 'assisted'] = shot_info.get('assisted')
        shots_df.at[idx, 'assisted_by'] = shot_info.get('assistedBy', {}).get('name')
        shots_df.at[idx, 'shot_range'] = shot_info.get('range')
        shots_df.at[idx, 'x'] = shot_info.get('location', {}).get('x')
        shots_df.at[idx, 'y'] = shot_info.get('location', {}).get('y')

# Parse distance from text
shots_df['distance'] = shots_df['playText'].str.extract(r'(\d+)-foot').astype(float)
shots_df['is_three'] = shots_df['playText'].str.contains('three point', case=False, na=False)

# Flatten lineups to strings
shots_df['home_lineup_key'] = shots_df['home_lineup'].apply(lineup_to_key)
shots_df['away_lineup_key'] = shots_df['away_lineup'].apply(lineup_to_key)

# Drop the list columns, keep flat ones
shots_df = shots_df.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"SHOTS_DF: {len(shots_df)} shots")
print(f"  Key columns: shooter_name, made, x, y, distance, is_three, home_lineup_key, away_lineup_key")

# 2. PLAYERS_DF - Player stats from this game (FLAT - already flat!)
players_df = players_flat_df.copy()
players_df['gameId'] = sample_game_id

# Flatten nested stat dicts
for col in ['rebounds', 'freeThrows', 'threePointFieldGoals', 'twoPointFieldGoals', 'fieldGoals']:
    if col in players_df.columns:
        for idx, row in players_df.iterrows():
            if isinstance(row[col], dict):
                for key, val in row[col].items():
                    players_df.at[idx, f'{col}_{key}'] = val
        players_df = players_df.drop(columns=[col])

print(f"\nPLAYERS_DF: {len(players_df)} players")
print(f"  Key columns: name, team, starter, position, minutes, points, assists, steals, blocks")

# 3. LINEUP_STINTS_DF - Track when each lineup was on court (FLAT)
def get_lineup_stints(pbp_df):
    """Calculate stint lengths for each unique lineup combination."""
    stints = []
    
    prev_home = None
    prev_away = None
    stint_start = None
    stint_start_score = None
    
    for idx, row in pbp_df.iterrows():
        home = tuple(sorted(row['home_lineup']))
        away = tuple(sorted(row['away_lineup']))
        
        if (home, away) != (prev_home, prev_away):
            if prev_home is not None:
                stints.append({
                    'home_lineup_key': ' | '.join(prev_home),
                    'away_lineup_key': ' | '.join(prev_away),
                    'start_seconds': stint_start,
                    'end_seconds': row['secondsRemaining'],
                    'start_home_score': stint_start_score[0],
                    'start_away_score': stint_start_score[1],
                    'end_home_score': row['homeScore'],
                    'end_away_score': row['awayScore'],
                })
            
            prev_home = home
            prev_away = away
            stint_start = row['secondsRemaining']
            stint_start_score = (row['homeScore'], row['awayScore'])
    
    if prev_home is not None:
        stints.append({
            'home_lineup_key': ' | '.join(prev_home),
            'away_lineup_key': ' | '.join(prev_away),
            'start_seconds': stint_start,
            'end_seconds': 0,
            'start_home_score': stint_start_score[0],
            'start_away_score': stint_start_score[1],
            'end_home_score': pbp_df.iloc[-1]['homeScore'],
            'end_away_score': pbp_df.iloc[-1]['awayScore'],
        })
    
    return pd.DataFrame(stints)

lineup_stints_df = get_lineup_stints(pbp_with_lineups.sort_values('secondsRemaining', ascending=False))
lineup_stints_df['gameId'] = sample_game_id
lineup_stints_df['duration_seconds'] = lineup_stints_df['start_seconds'] - lineup_stints_df['end_seconds']
lineup_stints_df['home_pts_scored'] = lineup_stints_df['end_home_score'] - lineup_stints_df['start_home_score']
lineup_stints_df['away_pts_scored'] = lineup_stints_df['end_away_score'] - lineup_stints_df['start_away_score']
lineup_stints_df['home_plus_minus'] = lineup_stints_df['home_pts_scored'] - lineup_stints_df['away_pts_scored']

print(f"\nLINEUP_STINTS_DF: {len(lineup_stints_df)} stints")
print(f"  Key columns: home_lineup_key, away_lineup_key, duration_seconds, home_plus_minus")

# 4. PBP_FLAT - Full PBP flattened
pbp_flat = pbp_with_lineups.copy()
pbp_flat['home_lineup_key'] = pbp_flat['home_lineup'].apply(lineup_to_key)
pbp_flat['away_lineup_key'] = pbp_flat['away_lineup'].apply(lineup_to_key)
pbp_flat = pbp_flat.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"\nPBP_FLAT: {len(pbp_flat)} plays")
print(f"  Key columns: playType, playText, homeScore, awayScore, home_lineup_key, away_lineup_key")

# Summary
print(f"\n{'='*60}")
print("FLAT DATAFRAMES READY FOR ANALYSIS")
print(f"{'='*60}")
print(f"  pbp_flat          : {len(pbp_flat):>4} rows - Full PBP (no nested objects)")
print(f"  shots_df          : {len(shots_df):>4} rows - Shots with x/y, made, lineups")
print(f"  players_df        : {len(players_df):>4} rows - Player box score stats")
print(f"  lineup_stints_df  : {len(lineup_stints_df):>4} rows - Lineup stints with +/-")

In [None]:
# Step 5j: Track possessions — clean state-machine approach
# Possessions change ONLY on explicit outcome events, never on team-attribution changes.
#
# Possession-ending events:
#   Made FG       → opponent inbounds
#   Turnover      → opponent gets ball (Lost Ball Turnover; Steal is the other side of same event)
#   Def Rebound   → rebounding team gets ball (prior miss already happened)
#   Dead Ball Reb  → after missed last FT, opponent inbounds
#   Last FT make  → opponent inbounds
#   Last FT miss  → live rebound (not an immediate possession end)
#   End Period    → possession over
#
# Non-ending events:
#   Off Rebound   → same team, same possession continues
#   PersonalFoul  → may lead to FTs but no possession change
#   Steal         → paired with turnover; the turnover is the possession-ender
#   Substitution / Timeout / Jumpball → no effect

def track_possessions_v2(game_df):
    """
    State-machine possession tracker.
    Returns DataFrame with possession_id, possession_team, and outcome per play.
    """
    game_df = game_df.sort_values(
        ['period', 'secondsRemaining'], ascending=[True, False]
    ).copy()

    teams = [t for t in game_df['team'].unique() if pd.notna(t)]

    def other_team(t):
        others = [x for x in teams if x != t]
        return others[0] if others else None

    FG_TYPES = {'JumpShot', 'LayUpShot', 'DunkShot', 'TipShot'}

    poss_id = 0
    poss_team = None   # who currently has the ball
    records = []

    for idx, row in game_df.iterrows():
        pt   = row.get('playType', '')
        txt  = (row.get('playText') or '').lower()
        team = row.get('team')

        outcome = None
        end_poss = False
        next_team = None

        # ---- Jumpball (tip-off / held ball) ----
        if pt == 'Jumpball':
            if 'won' in txt and team:
                # Winning team gets first possession
                if poss_team is None:
                    poss_team = team

        # ---- Field Goal attempt ----
        elif pt in FG_TYPES:
            # If we don't yet know who has the ball, it's the shooting team
            if poss_team is None:
                poss_team = team
            if 'makes' in txt:
                outcome = 'made_fg'
                end_poss = True
                next_team = other_team(poss_team)
            # else: miss → ball is loose, wait for rebound

        # ---- Free Throw ----
        elif pt == 'MadeFreeThrow':
            if poss_team is None and team:
                poss_team = team
            # Check if this is the LAST FT in the sequence
            is_last = any(f'{n} of {n}' in txt for n in ('1', '2', '3'))
            if is_last:
                if 'makes' in txt:
                    outcome = 'made_ft'
                    end_poss = True
                    next_team = other_team(poss_team)
                else:
                    outcome = 'missed_last_ft'
                    # Ball is loose — rebound determines next possession

        # ---- Turnover (Lost Ball Turnover, etc.) ----
        elif 'Turnover' in pt:
            if poss_team is None and team:
                poss_team = team
            outcome = 'turnover'
            end_poss = True
            next_team = other_team(poss_team)

        # ---- Steal (paired with turnover — don't double-count) ----
        elif pt == 'Steal':
            # The turnover event handles the possession change.
            # Just record that a steal occurred.
            outcome = 'steal'

        # ---- Defensive Rebound ----
        elif pt == 'Defensive Rebound':
            outcome = 'def_rebound'
            end_poss = True
            next_team = team  # rebounding team now has the ball

        # ---- Offensive Rebound ----
        elif pt == 'Offensive Rebound':
            outcome = 'off_rebound'
            # Same team, same possession — no change

        # ---- Dead Ball Rebound ----
        elif pt == 'Dead Ball Rebound':
            outcome = 'dead_ball_rebound'
            end_poss = True
            next_team = team  # credited team gets the ball

        # ---- End of Period / Game ----
        elif pt in ('End Period', 'End Game'):
            outcome = 'end_period'
            end_poss = True
            next_team = None  # no one has ball

        # ---- Everything else (fouls, subs, timeouts) ----
        # No possession change

        # Record this play
        records.append({
            'play_id': row.get('id'),
            'gameId': row.get('gameId'),
            'possession_id': poss_id,
            'possession_team': poss_team,
            'play_type': pt,
            'play_text': row.get('playText', ''),
            'team': team,
            'outcome': outcome,
        })

        # Advance possession if this play ended it
        if end_poss:
            poss_id += 1
            poss_team = next_team

    return pd.DataFrame(records)


# Run the new tracker on our single game
possessions_df = track_possessions_v2(single_game_df)

# Merge with full PBP
pbp_with_possessions = pbp_with_lineups.merge(
    possessions_df[['play_id', 'possession_id', 'possession_team', 'outcome']],
    left_on='id',
    right_on='play_id',
    how='left'
)

# ---- Summary ----
total_poss = possessions_df['possession_id'].nunique()
teams = [t for t in single_game_df['team'].unique() if pd.notna(t)]
home_team, away_team = teams[0], teams[1]

home_poss = possessions_df[possessions_df['possession_team'] == home_team]['possession_id'].nunique()
away_poss = possessions_df[possessions_df['possession_team'] == away_team]['possession_id'].nunique()

print(f"Total plays: {len(pbp_with_possessions)}")
print(f"Total possessions: {total_poss}")
print(f"  {home_team}: {home_poss}")
print(f"  {away_team}: {away_poss}")

# Possession outcomes
print(f"\n{'='*60}")
print("POSSESSION OUTCOMES")
print(f"{'='*60}")
outcomes = possessions_df[possessions_df['outcome'].notna()]['outcome'].value_counts()
print(outcomes)

# Estimate possessions using KenPom formula for comparison
print(f"\n{'='*60}")
print("KENPOM FORMULA CHECK")
print(f"{'='*60}")
print("Possessions from PBP state machine vs KenPom estimate")
print("(KenPom: FGA - ORB + TOV + 0.475*FTA)")
for team in teams:
    tp = single_game_df[single_game_df['team'] == team]
    fga = len(tp[tp['playType'].isin({'JumpShot','LayUpShot','DunkShot','TipShot'})])
    orb = len(tp[tp['playType'] == 'Offensive Rebound'])
    tov = len(tp[tp['playType'].str.contains('Turnover', na=False)])
    fta = len(tp[tp['playType'] == 'MadeFreeThrow'])
    kp_est = fga - orb + tov + 0.475 * fta
    pbp_count = possessions_df[possessions_df['possession_team'] == team]['possession_id'].nunique()
    print(f"  {team}: PBP={pbp_count}, KenPom={kp_est:.1f}  (FGA={fga} ORB={orb} TOV={tov} FTA={fta})")

# Show sample possessions
print(f"\n{'='*60}")
print("SAMPLE POSSESSIONS (first 5)")
print(f"{'='*60}")
for poss_id in range(5):
    poss_plays = possessions_df[possessions_df['possession_id'] == poss_id]
    if len(poss_plays) > 0:
        team = poss_plays['possession_team'].iloc[0]
        outcomes_in_poss = poss_plays['outcome'].dropna()
        final_outcome = outcomes_in_poss.iloc[-1] if len(outcomes_in_poss) > 0 else 'ongoing'
        print(f"\nPoss {poss_id} ({team}) → {final_outcome}")
        for _, p in poss_plays.iterrows():
            marker = "  *" if p['outcome'] else "   "
            print(f"{marker} {p['play_type']:<25} {str(p['play_text'])[:55]}")


In [None]:
# Step 5k: Previous Possession Ender & Possession Type Classification
#
# Built on top of track_possessions_v2 output (possessions_df) and raw PBP (single_game_df).
#
# PREVIOUS POSSESSION ENDER — refined version of how each possession ended:
#   live_ball_turnover   : turnover accompanied by a steal
#   dead_ball_turnover   : turnover without a steal (dead ball, opponent inbounds)
#   fga_def_rebound      : defensive rebound after a missed field goal attempt
#   fta_def_rebound      : defensive rebound after a missed free throw
#   made_fg              : made field goal (opponent inbounds)
#   made_ft              : made final free throw (opponent inbounds)
#   start_of_period      : first possession of a period (tip-off / start of half)
#   block_oob            : blocked shot with no rebound logged → ball went OOB
#   dead_ball_rebound    : dead ball rebound (lane violation, etc.)
#   end_period           : clock expired
#
# POSSESSION TYPE — classification of how the possession played out:
#   transition           : first FGA within 7s of possession start
#   half_court           : first FGA beyond 7s (set offense)
#   scramble_putback     : OREB then FGA within 3s of the rebound
#   second_chance        : post-OREB, FGA beyond 3s (reset offense)
#   intentional_foul     : non-shooting foul within 10s, no FGA, under 2 min left in period

FG_TYPES_K = {'JumpShot', 'LayUpShot', 'DunkShot', 'TipShot'}

def classify_possessions(possessions_df, game_df):
    """
    Build possession-level features from track_possessions_v2 output.

    Returns DataFrame with one row per possession, including:
      - refined_outcome : finer-grained version of the raw possession outcome
      - prev_poss_ender : what ended the *previous* possession (= why this team has the ball)
      - possession_type : transition / half_court / scramble_putback / second_chance / intentional_foul
    """
    # --- Merge timing info from game_df ---
    game_sorted = game_df.sort_values(
        ['period', 'secondsRemaining'], ascending=[True, False]
    ).reset_index(drop=True)
    game_sorted['play_order'] = range(len(game_sorted))

    time_info = game_sorted[['id', 'secondsRemaining', 'period', 'play_order']].rename(
        columns={'id': 'play_id'}
    )
    poss = possessions_df.merge(time_info, on='play_id', how='left')
    poss = poss.sort_values('play_order').reset_index(drop=True)

    # --- Per-possession analysis ---
    possession_rows = []

    for pid in sorted(poss['possession_id'].unique()):
        grp = poss[poss['possession_id'] == pid].sort_values('play_order')
        plays = grp.to_dict('records')

        game_id = grp['gameId'].iloc[0] if 'gameId' in grp.columns else None
        poss_team = grp['possession_team'].iloc[0]
        period = grp['period'].iloc[0]
        start_sec = grp['secondsRemaining'].iloc[0]
        end_sec = grp['secondsRemaining'].iloc[-1]
        duration = start_sec - end_sec

        # Outcomes
        all_outcomes = [p['outcome'] for p in plays if p['outcome'] is not None]
        final_outcome = all_outcomes[-1] if all_outcomes else None
        outcome_set = set(all_outcomes)

        has_steal = 'steal' in outcome_set
        has_oreb = 'off_rebound' in outcome_set

        # ============================================================
        # 1. REFINED OUTCOME
        # ============================================================

        # -- Turnover: live vs dead --
        if final_outcome == 'turnover':
            refined = 'live_ball_turnover' if has_steal else 'dead_ball_turnover'

        # -- Defensive rebound: after FGA miss vs FTA miss --
        elif final_outcome == 'def_rebound':
            miss_type = 'fga'  # default
            # Walk backward from the last def_rebound to find the preceding miss
            found_dreb = False
            for p in reversed(plays):
                if not found_dreb:
                    if p['outcome'] == 'def_rebound':
                        found_dreb = True
                    continue
                if p['play_type'] in FG_TYPES_K:
                    miss_type = 'fga'
                    break
                if p['play_type'] == 'MadeFreeThrow' and 'misses' in (p.get('play_text') or '').lower():
                    miss_type = 'fta'
                    break
            refined = f'{miss_type}_def_rebound'

        # -- Already distinct categories --
        elif final_outcome in ('made_fg', 'made_ft', 'end_period', 'dead_ball_rebound'):
            refined = final_outcome

        else:
            refined = final_outcome  # fallback (e.g. missed_last_ft edge cases)

        # -- Block → OOB override --
        # A blocked FGA with no subsequent rebound in this possession
        for i, p in enumerate(plays):
            if (p['play_type'] in FG_TYPES_K
                    and 'block' in (p.get('play_text') or '').lower()
                    and 'misses' in (p.get('play_text') or '').lower()):
                remaining = plays[i + 1:]
                has_rebound_after = any(
                    r['outcome'] in ('def_rebound', 'off_rebound', 'dead_ball_rebound')
                    for r in remaining
                )
                if not has_rebound_after:
                    # Confirm by checking next possession exists
                    next_poss = poss[poss['possession_id'] == pid + 1]
                    if len(next_poss) > 0:
                        refined = 'block_oob'
                    break  # only need the first unrebound block

        # ============================================================
        # 2. POSSESSION TYPE
        # ============================================================

        # First FGA timing
        fga_plays = [p for p in plays if p['play_type'] in FG_TYPES_K]
        first_fga_sec = fga_plays[0]['secondsRemaining'] if fga_plays else None
        time_to_first_fga = (start_sec - first_fga_sec) if first_fga_sec is not None else None

        # OREB → next FGA timing
        oreb_list = [p for p in plays if p['outcome'] == 'off_rebound']
        time_oreb_to_fga = None
        if oreb_list:
            oreb_sec = oreb_list[0]['secondsRemaining']
            post_oreb_fga = [p for p in fga_plays if p['secondsRemaining'] < oreb_sec]
            if post_oreb_fga:
                time_oreb_to_fga = oreb_sec - post_oreb_fga[0]['secondsRemaining']

        # Non-shooting foul check (for intentional foul)
        foul_plays = [
            p for p in plays
            if 'Foul' in (p.get('play_type') or '')
            and 'shooting' not in (p.get('play_text') or '').lower()
        ]
        foul_within_10s = False
        if foul_plays:
            first_foul_sec = foul_plays[0]['secondsRemaining']
            if (start_sec - first_foul_sec) <= 10:
                foul_within_10s = True

        # Classify
        if has_oreb:
            if time_oreb_to_fga is not None and time_oreb_to_fga <= 3:
                poss_type = 'scramble_putback'
            else:
                poss_type = 'second_chance'
        elif foul_plays and foul_within_10s and not fga_plays and start_sec <= 120:
            poss_type = 'intentional_foul'
        elif time_to_first_fga is not None:
            poss_type = 'transition' if time_to_first_fga <= 7 else 'half_court'
        else:
            poss_type = 'half_court'  # no FGA (turnover before getting a shot off)

        possession_rows.append({
            'gameId': game_id,
            'possession_id': pid,
            'possession_team': poss_team,
            'period': period,
            'start_seconds': start_sec,
            'end_seconds': end_sec,
            'duration_sec': duration,
            'raw_outcome': final_outcome,
            'refined_outcome': refined,
            'possession_type': poss_type,
            'has_oreb': has_oreb,
            'time_to_first_fga': time_to_first_fga,
            'time_oreb_to_fga': time_oreb_to_fga,
        })

    result = pd.DataFrame(possession_rows)

    # ============================================================
    # 3. PREVIOUS POSSESSION ENDER
    # ============================================================
    # For each possession, what ended the *previous* one
    prev_enders = ['start_of_period']  # first possession always
    for i in range(1, len(result)):
        if result.iloc[i]['period'] != result.iloc[i - 1]['period']:
            prev_enders.append('start_of_period')
        else:
            prev_enders.append(result.iloc[i - 1]['refined_outcome'])
    result['prev_poss_ender'] = prev_enders

    return result


# ============================================================
# RUN ON SINGLE GAME (Butler vs St. John's)
# ============================================================
poss_enriched = classify_possessions(possessions_df, single_game_df)

teams = [t for t in single_game_df['team'].unique() if pd.notna(t)]
home_team, away_team = teams[0], teams[1]

# ---- Summary: Refined Outcome ----
print(f"{'='*65}")
print("REFINED OUTCOME DISTRIBUTION")
print(f"{'='*65}")
print(poss_enriched['refined_outcome'].value_counts().to_string())

# ---- Summary: Previous Possession Ender ----
print(f"\n{'='*65}")
print("PREVIOUS POSSESSION ENDER DISTRIBUTION")
print(f"{'='*65}")
print(poss_enriched['prev_poss_ender'].value_counts().to_string())

print(f"\nBy team:")
for team in teams:
    team_poss = poss_enriched[poss_enriched['possession_team'] == team]
    print(f"\n  {team} ({len(team_poss)} possessions):")
    print(f"    {team_poss['prev_poss_ender'].value_counts().to_string()}")

# ---- Summary: Possession Type ----
print(f"\n{'='*65}")
print("POSSESSION TYPE DISTRIBUTION")
print(f"{'='*65}")
print(poss_enriched['possession_type'].value_counts().to_string())

print(f"\nBy team:")
for team in teams:
    team_poss = poss_enriched[poss_enriched['possession_team'] == team]
    print(f"\n  {team} ({len(team_poss)} possessions):")
    print(f"    {team_poss['possession_type'].value_counts().to_string()}")

# ---- Cross-tab: Possession Type x Previous Ender ----
print(f"\n{'='*65}")
print("CROSS-TAB: Possession Type x Previous Ender")
print(f"{'='*65}")
ct = pd.crosstab(poss_enriched['possession_type'], poss_enriched['prev_poss_ender'])
print(ct.to_string())

# ---- Sample possessions with all features ----
print(f"\n{'='*65}")
print("SAMPLE POSSESSIONS (first 10)")
print(f"{'='*65}")
for _, row in poss_enriched.head(10).iterrows():
    print(f"\nPoss {row['possession_id']} ({row['possession_team']}) "
          f"| {row['duration_sec']:.0f}s "
          f"| ender={row['prev_poss_ender']} "
          f"| type={row['possession_type']} "
          f"| outcome={row['refined_outcome']}")
    # Show plays in this possession
    poss_plays = possessions_df[possessions_df['possession_id'] == row['possession_id']]
    for _, p in poss_plays.iterrows():
        marker = " *" if p['outcome'] else "  "
        print(f"  {marker} {p['play_type']:<25} {str(p['play_text'])[:55]}")

In [32]:
# Step 6: Save play-by-play data to CSV
output_dir = 'cbbd_data'

# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Create a timestamp for the filename
date_str = search_date.strftime('%Y%m%d')

# Save games data
if 'games_df' in locals() and len(games_df) > 0:
    filename = f'{output_dir}/games_{date_str}_{season}.csv'
    games_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(games_df)} games to {filename}")

# Save all plays data
if 'plays_df' in locals() and len(plays_df) > 0:
    filename = f'{output_dir}/plays_{date_str}_{season}.csv'
    plays_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(plays_df):,} plays to {filename}")

# Save shot locations data
if 'shot_locations_df' in locals() and len(shot_locations_df) > 0:
    filename = f'{output_dir}/shot_locations_{date_str}_{season}.csv'
    shot_locations_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(shot_locations_df)} shot locations to {filename}")


# Save flattened PBP with lineups (from Step 5i)
if 'pbp_flat' in locals() and len(pbp_flat) > 0:
    filename = f'{output_dir}/pbp_flat_{game_tag}.csv'
    pbp_flat.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(pbp_flat):,} flattened PBP rows to {filename}")

# Save single-game analysis DataFrames
game_tag = f'{date_str}_{sample_game_id}'

# Possessions (play-level tracking from Step 5j)
if 'possessions_df' in locals() and len(possessions_df) > 0:
    filename = f'{output_dir}/possessions_{game_tag}.csv'
    possessions_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(possessions_df):,} possession plays to {filename}")

# Enriched possessions (possession-level summary from Step 5k)
if 'poss_enriched' in locals() and len(poss_enriched) > 0:
    filename = f'{output_dir}/possessions_enriched_{game_tag}.csv'
    poss_enriched.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(poss_enriched)} possessions to {filename}")

# Shots with lineup context (from Step 5i)
if 'shots_df' in locals() and len(shots_df) > 0:
    filename = f'{output_dir}/shots_{game_tag}.csv'
    shots_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(shots_df)} shots to {filename}")

# Lineup stints (from Step 5i)
if 'lineup_stints_df' in locals() and len(lineup_stints_df) > 0:
    filename = f'{output_dir}/lineup_stints_{game_tag}.csv'
    lineup_stints_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(lineup_stints_df)} lineup stints to {filename}")

# Player stats (from Step 5i)
if 'players_df' in locals() and len(players_df) > 0:
    filename = f'{output_dir}/players_{game_tag}.csv'
    players_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(players_df)} player rows to {filename}")

# Four factors for all games (from Step 7d)
if 'all_ff_df' in locals() and len(all_ff_df) > 0:
    filename = f'{output_dir}/four_factors_{date_str}_{season}.csv'
    all_ff_df.to_csv(filename, index=False)
    print(f"\u2713 Saved {len(all_ff_df)} team-game four factor rows to {filename}")

print(f"\n{'='*50}")
print(f"All data saved to {output_dir}/")
print(f"{'='*50}")


✓ Saved 11 games to cbbd_data/games_20260107_2026.csv
✓ Saved 5,172 plays to cbbd_data/plays_20260107_2026.csv

All data saved to cbbd_data/


In [33]:
# Summary statistics for collected data
if 'team_plays_df' in locals() and len(team_plays_df) > 0:
    print("=" * 50)
    print("DATA SUMMARY")
    print("=" * 50)
    print(f"\nTeam: {team_name}")
    print(f"Season: {season-1}/{season}")
    print(f"Total Plays: {len(team_plays_df):,}")
    print(f"Total Games: {team_plays_df['game_id'].nunique()}")
    
    print("\nPlay Type Distribution:")
    print(team_plays_df['type'].value_counts())
    
    # Check for missing data
    print("\nMissing Data Check:")
    print(team_plays_df.isnull().sum())
    
    # Period distribution
    if 'period' in team_plays_df.columns:
        print("\nPlays by Period:")
        print(team_plays_df['period'].value_counts().sort_index())
    
    print("\n" + "=" * 50)

---

## Next Steps

Now that you have play-by-play data, you can:
1. Create shot charts using matplotlib/seaborn
2. Calculate advanced metrics (offensive rating, pace, etc.)
3. Analyze lineup performance
4. Build predictive models
5. Track player development over the season

For shot chart creation, check out the CBBD blog post: https://blog.collegefootballdata.com/talking-tech-generating-shot-charts-using-the-basketball-api/

---

**End of Notebook**

## 7. Compute Four Factors from PBP & Validate Against KenPom/Barttorvik

The goal: prove we can accurately derive KenPom's Four Factors from play-by-play data.

**Four Factors (per team per game):**
| Stat | Formula | What it measures |
|------|---------|-----------------|
| eFG% | (FGM + 0.5 × 3PM) / FGA | Shooting efficiency (weights threes) |
| TO% | TOV / Possessions | Turnover rate per possession |
| ORB% | ORB / (ORB + Opp DRB) | Offensive rebounding rate |
| FT Rate | FTA / FGA | Free throw rate |
| Tempo | Possessions / (Minutes / 40) | Pace of play |
| 3PA Rate | 3PA / FGA | Three-point attempt rate |

**Play types we need from PBP:**
- `JumpShot`, `LayUpShot`, `DunkShot`, `TipShot` → FGA (check `made` + `is_three`)
- `MadeFreeThrow` → FTA (both makes AND misses use this type; parse text)
- `Turnover` → TOV
- `Offensive Rebound` → ORB
- `Defensive Rebound` → DRB
- `PersonalFoul` → context for FT trips

In [34]:
# Step 7a: Compute Four Factors from PBP for a single game
# Uses single_game_df from Step 5a (Butler vs St. John's)

import re

FG_PLAY_TYPES = {'JumpShot', 'LayUpShot', 'DunkShot', 'TipShot'}

def compute_four_factors(game_df):
    """
    Compute Four Factors for each team from raw play-by-play data.
    Returns a dict keyed by team name.
    """
    teams = game_df[game_df['team'].notna()]['team'].unique()
    results = {}

    for team in teams:
        team_plays = game_df[game_df['team'] == team]
        opp_name = [t for t in teams if t != team][0]
        opp_plays = game_df[game_df['team'] == opp_name]

        # --- Field Goals ---
        fg_plays = team_plays[team_plays['playType'].isin(FG_PLAY_TYPES)]
        fga = len(fg_plays)
        fgm = fg_plays['playText'].str.contains('makes', case=False, na=False).sum()

        # Three-pointers (from play text since shotInfo.threePointer is often null)
        three_pa = fg_plays['playText'].str.contains('three point', case=False, na=False).sum()
        three_pm = fg_plays[
            fg_plays['playText'].str.contains('three point', case=False, na=False) &
            fg_plays['playText'].str.contains('makes', case=False, na=False)
        ].shape[0]

        two_pa = fga - three_pa
        two_pm = fgm - three_pm

        # --- Free Throws ---
        # MadeFreeThrow is used for BOTH makes and misses
        ft_plays = team_plays[team_plays['playType'] == 'MadeFreeThrow']
        fta = len(ft_plays)
        ftm = ft_plays['playText'].str.contains('makes', case=False, na=False).sum()

        # --- Turnovers ---
        tov = len(team_plays[team_plays['playType'].str.contains('Turnover', na=False)])

        # --- Rebounds ---
        orb = len(team_plays[team_plays['playType'] == 'Offensive Rebound'])
        # Opponent's defensive rebounds (needed for ORB%)
        opp_drb = len(opp_plays[opp_plays['playType'] == 'Defensive Rebound'])
        # Team's own defensive rebounds (for opponent's ORB%)
        drb = len(team_plays[team_plays['playType'] == 'Defensive Rebound'])

        # --- Possessions (KenPom formula) ---
        # Possessions ≈ FGA - ORB + TOV + 0.475 * FTA
        possessions = fga - orb + tov + 0.475 * fta

        # --- Four Factors ---
        efg_pct = (fgm + 0.5 * three_pm) / fga * 100 if fga > 0 else 0
        to_pct = tov / possessions * 100 if possessions > 0 else 0
        orb_pct = orb / (orb + opp_drb) * 100 if (orb + opp_drb) > 0 else 0
        ft_rate = fta / fga * 100 if fga > 0 else 0  # FTA/FGA as percentage
        three_pa_rate = three_pa / fga * 100 if fga > 0 else 0

        # Tempo: possessions per 40 min (regulation game = 40 min)
        # For OT, would need to account for extra periods
        n_periods = game_df['period'].max()
        game_minutes = 40 if n_periods <= 2 else 40 + (n_periods - 2) * 5
        tempo = possessions / (game_minutes / 40)

        results[team] = {
            'FGA': fga, 'FGM': fgm,
            '3PA': three_pa, '3PM': three_pm,
            '2PA': two_pa, '2PM': two_pm,
            'FTA': fta, 'FTM': ftm,
            'TOV': tov,
            'ORB': orb, 'DRB': drb, 'Opp_DRB': opp_drb,
            'Possessions': round(possessions, 1),
            'eFG%': round(efg_pct, 1),
            'TO%': round(to_pct, 1),
            'ORB%': round(orb_pct, 1),
            'FT_Rate': round(ft_rate, 1),
            '3PA_Rate': round(three_pa_rate, 1),
            'Tempo': round(tempo, 1),
        }

    return results

# Run on our single game
four_factors = compute_four_factors(single_game_df)

# Display results
print(f"{'='*70}")
print(f"FOUR FACTORS: Butler vs St. John's (Game {sample_game_id})")
print(f"{'='*70}")

ff_df = pd.DataFrame(four_factors).T
print("\nRaw counts:")
print(ff_df[['FGA', 'FGM', '3PA', '3PM', '2PA', '2PM', 'FTA', 'FTM', 'TOV', 'ORB', 'DRB', 'Possessions']])
print("\nFour Factors:")
print(ff_df[['eFG%', 'TO%', 'ORB%', 'FT_Rate', '3PA_Rate', 'Tempo']])

FOUR FACTORS: Butler vs St. John's (Game 214340)

Raw counts:
             FGA   FGM   3PA  3PM   2PA   2PM   FTA   FTM   TOV  ORB   DRB  \
Butler      48.0  24.0  17.0  7.0  31.0  17.0  18.0  15.0  21.0  6.0  28.0   
St. John's  66.0  33.0  19.0  7.0  47.0  26.0  15.0  11.0   5.0  7.0  21.0   

            Possessions  
Butler             71.5  
St. John's         71.1  

Four Factors:
            eFG%   TO%  ORB%  FT_Rate  3PA_Rate  Tempo
Butler      57.3  29.4  22.2     37.5      35.4   71.5
St. John's  55.3   7.0  20.0     22.7      28.8   71.1


In [36]:
# Step 7b: Cross-validate PBP Four Factors against API box score stats
# The players_df already has per-player stats from get_game_players
# Aggregate those to team level and compare

def aggregate_box_score(players_df):
    """Aggregate player-level box scores to team totals."""
    team_stats = {}
    for team in players_df['team'].unique():
        tp = players_df[players_df['team'] == team]
        team_stats[team] = {
            'box_FGM': tp['fieldGoals_made'].sum() if 'fieldGoals_made' in tp.columns else None,
            'box_FGA': tp['fieldGoals_attempted'].sum() if 'fieldGoals_attempted' in tp.columns else None,
            'box_3PM': tp['threePointFieldGoals_made'].sum() if 'threePointFieldGoals_made' in tp.columns else None,
            'box_3PA': tp['threePointFieldGoals_attempted'].sum() if 'threePointFieldGoals_attempted' in tp.columns else None,
            'box_FTM': tp['freeThrows_made'].sum() if 'freeThrows_made' in tp.columns else None,
            'box_FTA': tp['freeThrows_attempted'].sum() if 'freeThrows_attempted' in tp.columns else None,
            'box_TOV': tp['turnovers'].sum(),
            'box_REB': tp['rebounds'].sum() if 'rebounds' in tp.columns else None,
            'box_PTS': tp['points'].sum(),
        }
    return team_stats

box_stats = aggregate_box_score(players_df)

# Compare PBP-derived vs Box Score
print(f"{'='*70}")
print("VALIDATION: PBP-derived vs API Box Score")
print(f"{'='*70}")

for team in four_factors:
    pbp = four_factors[team]
    box = box_stats.get(team, {})
    print(f"\n--- {team} ---")
    print(f"{'Stat':<10} {'PBP':>8} {'Box':>8} {'Match':>8}")
    print(f"{'-'*36}")

    comparisons = [
        ('FGA', pbp['FGA'], box.get('box_FGA')),
        ('FGM', pbp['FGM'], box.get('box_FGM')),
        ('3PA', pbp['3PA'], box.get('box_3PA')),
        ('3PM', pbp['3PM'], box.get('box_3PM')),
        ('FTA', pbp['FTA'], box.get('box_FTA')),
        ('FTM', pbp['FTM'], box.get('box_FTM')),
        ('TOV', pbp['TOV'], box.get('box_TOV')),
    ]

    for stat, pbp_val, box_val in comparisons:
        if box_val is not None:
            match = "OK" if pbp_val == box_val else f"DIFF ({box_val - pbp_val:+.0f})"
            print(f"{stat:<10} {pbp_val:>8} {box_val:>8.0f} {match:>8}")
        else:
            print(f"{stat:<10} {pbp_val:>8} {'N/A':>8}")

    # Compute eFG% from box score for comparison
    if box.get('box_FGA') and box.get('box_FGA') > 0:
        box_efg = (box['box_FGM'] + 0.5 * box['box_3PM']) / box['box_FGA'] * 100
        print(f"\n{'eFG%':<10} {pbp['eFG%']:>7.1f}% {box_efg:>7.1f}%")

    # Check points add up
    pbp_pts = pbp['2PM'] * 2 + pbp['3PM'] * 3 + pbp['FTM']
    box_pts = box.get('box_PTS', 0)
    pts_match = "OK" if pbp_pts == box_pts else f"DIFF ({box_pts - pbp_pts:+.0f})"
    print(f"{'Points':<10} {pbp_pts:>8} {box_pts:>8.0f} {pts_match:>8}")

VALIDATION: PBP-derived vs API Box Score

--- Butler ---
Stat            PBP      Box    Match
------------------------------------
FGA              48       48       OK
FGM              24       24       OK
3PA              17       17       OK
3PM               7        7       OK
FTA              18       18       OK
FTM              15       15       OK
TOV              21       20 DIFF (-1)

eFG%          57.3%    57.3%
Points           70       70       OK

--- St. John's ---
Stat            PBP      Box    Match
------------------------------------
FGA              66       66       OK
FGM              33       33       OK
3PA              19       19       OK
3PM               7        7       OK
FTA              15       15       OK
FTM              11       11       OK
TOV               5        5       OK

eFG%          55.3%    55.3%
Points           84       84       OK


In [37]:
# Step 7d: Compute Four Factors for ALL games in plays_df
# This gives us a multi-game dataset to compare against season averages

all_game_factors = []

for game_id in plays_df['gameId'].unique():
    game_df = plays_df[plays_df['gameId'] == game_id].copy()
    teams = game_df[game_df['team'].notna()]['team'].unique()

    if len(teams) < 2:
        continue

    try:
        ff = compute_four_factors(game_df)
        for team, stats in ff.items():
            stats['game_id'] = game_id
            stats['team'] = team
            stats['opponent'] = [t for t in teams if t != team][0]
            all_game_factors.append(stats)
    except Exception as e:
        print(f"Error on game {game_id}: {e}")

all_ff_df = pd.DataFrame(all_game_factors)

print(f"Four Factors computed for {len(all_ff_df)} team-games across {all_ff_df['game_id'].nunique()} games")
print(f"\n{'='*70}")
print("ALL GAMES - Four Factors")
print(f"{'='*70}")
display_cols = ['team', 'opponent', 'FGA', 'FGM', '3PA', '3PM', 'FTA', 'FTM', 'TOV',
                'ORB', 'Possessions', 'eFG%', 'TO%', 'ORB%', 'FT_Rate', 'Tempo']
print(all_ff_df[display_cols].to_string(index=False))

# Season averages across all games per team (if a team appears multiple times)
print(f"\n{'='*70}")
print("TEAM AVERAGES (across games in dataset)")
print(f"{'='*70}")
avg_cols = ['eFG%', 'TO%', 'ORB%', 'FT_Rate', '3PA_Rate', 'Tempo', 'Possessions']
team_avgs = all_ff_df.groupby('team')[avg_cols].mean().round(1)
print(team_avgs.to_string())

Four Factors computed for 22 team-games across 11 games

ALL GAMES - Four Factors
            team         opponent  FGA  FGM  3PA  3PM  FTA  FTM  TOV  ORB  Possessions  eFG%  TO%  ORB%  FT_Rate  Tempo
          Butler       St. John's   48   24   17    7   18   15   21    6         71.5  57.3 29.4  22.2     37.5   71.5
      St. John's           Butler   66   33   19    7   15   11    5    7         71.1  55.3  7.0  20.0     22.7   71.1
         Georgia          Florida   70   29   19    4   23   15    9   11         78.9  44.3 11.4  24.4     32.9   78.9
         Florida          Georgia   75   35   25    6   23   16   13   22         76.9  50.7 16.9  47.8     30.7   76.9
        Syracuse     Georgia Tech   61   26   19    7   31   23   15   14         76.7  48.4 19.6  37.8     50.8   76.7
    Georgia Tech         Syracuse   64   25   18    4   26   18    9   11         74.3  42.2 12.1  26.8     40.6   74.3
      Ball State Eastern Michigan   51   17   16    5   15   13    7    7     