# College Basketball Play-by-Play Data Collection
## Using CollegeBasketballData.com API for 2025/26 Season

This notebook demonstrates how to use the CollegeBasketballData.com (CBBD) API to collect play-by-play data for the 2025/26 college basketball season.

### Prerequisites
- Python 3.7+
- API key from CollegeBasketballData.com (register at https://collegebasketballdata.com)
- Install required packages: `pip install cbbd pandas numpy`

## 1. Setup and Import Libraries

In [1]:
# Import required libraries
import cbbd
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from cbbd.rest import ApiException
from pprint import pprint

# Change working directory to Desktop
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
os.chdir(desktop_path)
print(f"Working directory changed to: {os.getcwd()}")

Working directory changed to: C:\Users\shank.subramani_betf\Desktop


## 2. Configure API Authentication

The CBBD API uses Bearer token authentication. You'll be prompted to enter your API key securely.

In [2]:
# Configure API key
import getpass

# Prompt user for API key (input will be hidden for security)
print("Please enter your CollegeBasketballData.com API key")
print("(Get your free key at: https://collegebasketballdata.com)")
api_key = getpass.getpass("API Key: ")

# Alternatively, use environment variable if already set
# api_key = os.environ.get('CBBD_API_KEY')

# Validate that a key was entered
if not api_key or api_key.strip() == '':
    raise ValueError("API key is required. Please run this cell again and enter your key.")

# Configure the API client
configuration = cbbd.Configuration(
    host="https://api.collegebasketballdata.com",
    access_token=api_key
)

print("✓ API configuration completed successfully!")

Please enter your CollegeBasketballData.com API key
(Get your free key at: https://collegebasketballdata.com)
✓ API configuration completed successfully!


## 3. Set Season Parameter

For the 2025/26 season, use `season=2026` (NCAA uses the end year of the academic year)

In [3]:
# Set the season (use 2026 for 2025/26 season)
season = 2026
print(f"Season set to: {season-1}/{season}")

Season set to: 2025/2026


## 4. Choose Your Data Collection Method

The CBBD API provides several endpoints for accessing play-by-play data. Choose the method that best fits your needs:

- **Method 1 (4a)**: Get plays for a **specific game** (when you know the game ID)
- **Method 2 (4b)**: Get plays by **date range** (all games between two dates)
- **Method 3 (4c)**: Get plays by **team** (entire season for one team)
- **Method 4 (4d)**: Get **shooting plays only** (for shot charts and shooting analysis)

**You only need to run ONE of the methods below (4a, 4b, 4c, or 4d) based on your use case.**

### 4a. Method 1: Get Plays for a Specific Game

First, find games for a specific date to get game IDs, then retrieve plays for a specific game

In [12]:
# Combined: Get games for a date and collect play-by-play for all of them
from datetime import datetime
import time

# Prompt user for the date
date_input = input("Enter date to search for games (YYYY-MM-DD) or press Enter for today: ").strip()

# Use today's date if nothing entered
if not date_input:
    search_date = datetime.now()
else:
    # Parse the string into a datetime object
    search_date = datetime.strptime(date_input, '%Y-%m-%d')

with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get games for the specified date
        games = games_api.get_games(
            start_date_range=search_date,
            end_date_range=search_date,
            season=season
        )
        
        # Convert to DataFrame
        games_df = pd.DataFrame([g.to_dict() for g in games])
        
        if len(games_df) > 0:
            # Collect play-by-play for all games
            all_plays = []
            
            for idx, game in games_df.iterrows():
                game_id = game['id']
                
                try:
                    # Get plays for this game
                    plays = plays_api.get_plays(game_id=game_id)
                    
                    # Convert to list of dicts
                    game_plays = [p.to_dict() for p in plays]
                    
                    # Add to our collection
                    all_plays.extend(game_plays)
                    
                    # Small delay to be nice to the API
                    time.sleep(0.5)
                    
                except ApiException as e:
                    continue
            
            # Convert all plays to DataFrame
            plays_df = pd.DataFrame(all_plays)
            
        else:
            plays_df = pd.DataFrame()
        
    except ApiException as e:
        games_df = pd.DataFrame()
        plays_df = pd.DataFrame()

# Quick summary
print(f"Games: {len(games_df)}, Plays: {len(plays_df):,}")

Games: 11, Plays: 5,172


In [13]:
plays_df

Unnamed: 0,id,sourceId,gameId,gameSourceId,gameStartDate,season,seasonType,gameType,playType,period,...,playText,participants,isHomeTeam,teamId,team,conference,opponentId,opponent,opponentConference,shotInfo
0,43562565,401822888116889907,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Start game,[],,,,,,,,
1,43562567,401822888116889909,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball lost by Butler,"[{'name': 'Drayton Jones', 'id': 1610}]",True,34.0,Butler,Big East,279.0,St. John's,Big East,
2,43562566,401822888116889908,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball won by St. John's,"[{'name': 'Zuby Ejiofor', 'id': 1424}]",False,279.0,St. John's,Big East,34.0,Butler,Big East,
3,43562568,401822888116889913,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,1,...,Ian Jackson misses 13-foot jumper,"[{'name': 'Ian Jackson', 'id': 243}]",False,279.0,St. John's,Big East,34.0,Butler,Big East,"{'shooter': {'name': 'Ian Jackson', 'id': 243}..."
4,43562569,401822888116889914,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,1,...,Drayton Jones Defensive Rebound.,"[{'name': 'Drayton Jones', 'id': 1610}]",True,34.0,Butler,Big East,279.0,St. John's,Big East,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,43555113,401825423116900592,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,2,...,Aday Mara Defensive Rebound.,"[{'name': 'Aday Mara', 'id': 535}]",False,170.0,Michigan,Big Ten,226.0,Penn State,Big Ten,
5168,43555117,401825423116900987,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Game,2,...,End of Game,[],,,,,,,,
5169,43555116,401825423116900986,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Period,2,...,End of 2nd half,[],,,,,,,,
5170,43555115,401825423116900942,214336,401825423,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Dead Ball Rebound,2,...,Penn State Deadball Team Rebound.,[],True,226.0,Penn State,Big Ten,170.0,Michigan,Big Ten,


### 4b. Method 2: Get Plays by Date Range

This retrieves all plays for all games within a date range - useful for analyzing a specific time period

In [6]:
# Define date range for play-by-play collection
start_date = '2025-11-04'
end_date = '2025-11-07'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get plays for all games in date range
        plays_by_date = plays_api.get_plays_by_date(
            season=season,
            start_date=start_date,
            end_date=end_date
        )
        
        # Convert to DataFrame
        plays_date_df = pd.DataFrame([p.to_dict() for p in plays_by_date])
        
        print(f"Total plays from {start_date} to {end_date}: {len(plays_date_df)}")
        print(f"\nUnique games: {plays_date_df['game_id'].nunique() if len(plays_date_df) > 0 else 0}")
        print(f"Play types: {plays_date_df['type'].value_counts().head(10) if len(plays_date_df) > 0 else 'No data'}")
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_date: {e}")
        plays_date_df = pd.DataFrame()

ValidationError: 1 validation error for GetPlaysByDate
var_date
  field required (type=value_error.missing)

### 4c. Method 3: Get Plays by Team

This retrieves all plays for a specific team's entire season - ideal for team-specific analysis

In [None]:
# Get play-by-play for a specific team (example: Duke)
team_name = 'Duke'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get all plays for the team's season
        team_plays = plays_api.get_plays_by_team(
            season=season,
            team=team_name
        )
        
        # Convert to DataFrame
        team_plays_df = pd.DataFrame([p.to_dict() for p in team_plays])
        
        print(f"Total plays for {team_name} in {season-1}/{season} season: {len(team_plays_df)}")
        if len(team_plays_df) > 0:
            print(f"\nGames played: {team_plays_df['game_id'].nunique()}")
            print(f"\nPlay type distribution:")
            print(team_plays_df['type'].value_counts())
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_team: {e}")
        team_plays_df = pd.DataFrame()

### 4d. Method 4: Get Shooting Plays Only

Filter for only shooting plays - essential for shot charts and shooting efficiency analysis

In [None]:
# Get shooting plays only for a team
with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get shooting plays only
        shooting_plays = plays_api.get_plays_by_team(
            season=season,
            team=team_name,
            shooting_plays_only=True  # This filters for only shooting plays
        )
        
        # Convert to DataFrame
        shooting_df = pd.DataFrame([p.to_dict() for p in shooting_plays])
        
        print(f"Shooting plays for {team_name}: {len(shooting_df)}")
        if len(shooting_df) > 0:
            print(f"\nShot location data available: {'shot' in shooting_df.columns}")
            print(f"\nSample shooting plays:")
            print(shooting_df[['period', 'clock', 'team', 'type', 'description']].head())
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_team: {e}")
        shooting_df = pd.DataFrame()

## 5. Process a Single Game (Start Here!)

Before processing all games, let's focus on getting ONE game right. This approach helps us:
- Understand the exact data structure
- Debug any parsing issues
- Validate our output before scaling up

In [14]:
# Step 5a: Pick ONE game and examine its structure
# This is crucial - we need to understand the data before processing it

# Get unique game IDs
if 'plays_df' in locals() and len(plays_df) > 0:
    unique_games = plays_df['gameId'].unique()
    print(f"Total unique games: {len(unique_games)}")
    
    # Pick the first game
    sample_game_id = unique_games[0]
    print(f"\nSelected game ID: {sample_game_id}")
    
    # Filter for just this game
    single_game_df = plays_df[plays_df['gameId'] == sample_game_id].copy()
    print(f"Total plays in this game: {len(single_game_df)}")
    
    # Get teams playing
    teams = single_game_df[single_game_df['team'].notna()]['team'].unique()
    print(f"Teams: {teams}")
    
    # Show all columns available
    print(f"\nAll columns ({len(single_game_df.columns)}):")
    for col in single_game_df.columns:
        print(f"  - {col}")
else:
    print("No plays_df available - run a data collection method first (4a, 4b, 4c, or 4d)")

Total unique games: 11

Selected game ID: 214340
Total plays in this game: 429
Teams: ['Butler' "St. John's"]

All columns (29):
  - id
  - sourceId
  - gameId
  - gameSourceId
  - gameStartDate
  - season
  - seasonType
  - gameType
  - playType
  - period
  - clock
  - secondsRemaining
  - homeScore
  - awayScore
  - homeWinProbability
  - scoringPlay
  - shootingPlay
  - scoreValue
  - wallclock
  - playText
  - participants
  - isHomeTeam
  - teamId
  - team
  - conference
  - opponentId
  - opponent
  - opponentConference
  - shotInfo


In [15]:
# Step 5b: Examine the shotInfo structure for ONE shooting play
# This is where the previous code failed - shotInfo is a DICTIONARY, not an object!

# Get shooting plays from our single game
shooting_plays_single = single_game_df[single_game_df['shootingPlay'] == True]
print(f"Shooting plays in this game: {len(shooting_plays_single)}")

if len(shooting_plays_single) > 0:
    # Get the first shooting play with shotInfo
    sample_shot = shooting_plays_single[shooting_plays_single['shotInfo'].notna()].iloc[0]
    
    print(f"\n{'='*60}")
    print("SAMPLE SHOOTING PLAY")
    print(f"{'='*60}")
    print(f"Play Type: {sample_shot['playType']}")
    print(f"Play Text: {sample_shot['playText']}")
    print(f"Team: {sample_shot['team']}")
    
    print(f"\n{'='*60}")
    print("shotInfo STRUCTURE (this is the key!)")
    print(f"{'='*60}")
    print(f"Type: {type(sample_shot['shotInfo'])}")
    print(f"\nFull shotInfo content:")
    pprint(sample_shot['shotInfo'])
    
    # Show all available keys
    if isinstance(sample_shot['shotInfo'], dict):
        print(f"\nAvailable keys in shotInfo:")
        for key in sample_shot['shotInfo'].keys():
            print(f"  - {key}: {type(sample_shot['shotInfo'][key])}")
else:
    print("No shooting plays found in this game")

Shooting plays in this game: 147

SAMPLE SHOOTING PLAY
Play Type: JumpShot
Play Text: Ian Jackson misses 13-foot jumper
Team: St. John's

shotInfo STRUCTURE (this is the key!)
Type: <class 'dict'>

Full shotInfo content:
{'assisted': False,
 'assistedBy': {'id': None, 'name': None},
 'location': {'x': 780.2, 'y': 320},
 'made': False,
 'range': 'jumper',
 'shooter': {'id': 243, 'name': 'Ian Jackson'}}

Available keys in shotInfo:
  - shooter: <class 'dict'>
  - made: <class 'bool'>
  - range: <class 'str'>
  - assisted: <class 'bool'>
  - assistedBy: <class 'dict'>
  - location: <class 'dict'>


In [16]:
# Step 5c: CORRECT parsing - shotInfo is a DICTIONARY, use .get() not hasattr()!
# Process all shooting plays for our single game
import re

def extract_shot_data(row):
    """Extract all shot data from a single play row - handles dictionary structure correctly."""
    shot_info = row.get('shotInfo')
    play_text = row.get('playText') or ''
    
    # Start with basic play info
    shot_data = {
        'play_id': row.get('id'),
        'game_id': row.get('gameId'),
        'period': row.get('period'),
        'clock': row.get('clock'),
        'seconds_remaining': row.get('secondsRemaining'),
        'team': row.get('team'),
        'team_id': row.get('teamId'),
        'opponent': row.get('opponent'),
        'is_home_team': row.get('isHomeTeam'),
        'play_type': row.get('playType'),
        'play_text': play_text,
        'score_value': row.get('scoreValue'),
        'home_score': row.get('homeScore'),
        'away_score': row.get('awayScore'),
    }
    
    # Parse from play text
    # Distance: "24-foot" -> 24
    distance_match = re.search(r'(\d+)-foot', play_text)
    shot_data['distance'] = int(distance_match.group(1)) if distance_match else None
    
    # Three pointer: "three point" (with space)
    shot_data['is_three_from_text'] = 'three point' in play_text.lower()
    
    # Made/missed from text
    shot_data['made_from_text'] = ' makes ' in play_text.lower() or ' made ' in play_text.lower()
    
    # Extract from shotInfo dictionary (THIS IS THE KEY FIX!)
    if shot_info and isinstance(shot_info, dict):
        # Basic shot info
        shot_data['made'] = shot_info.get('made')
        shot_data['assisted'] = shot_info.get('assisted')
        shot_data['three_pointer'] = shot_info.get('threePointer')
        
        # Shooter info (nested dict)
        shooter = shot_info.get('shooter')
        if shooter and isinstance(shooter, dict):
            shot_data['shooter_name'] = shooter.get('name')
            shot_data['shooter_id'] = shooter.get('id')
        
        # Location info (nested dict)
        location = shot_info.get('location')
        if location and isinstance(location, dict):
            shot_data['x'] = location.get('x')
            shot_data['y'] = location.get('y')
    
    return shot_data

# Process our single game
print(f"Processing game: {sample_game_id}")
print(f"Shooting plays to process: {len(shooting_plays_single)}")

# Extract shot data for each shooting play
single_game_shots = []
for idx, row in shooting_plays_single.iterrows():
    shot_data = extract_shot_data(row.to_dict())
    single_game_shots.append(shot_data)

# Convert to DataFrame
single_game_shots_df = pd.DataFrame(single_game_shots)

print(f"\n{'='*60}")
print("EXTRACTION RESULTS")
print(f"{'='*60}")
print(f"Total shots extracted: {len(single_game_shots_df)}")
print(f"\nColumns extracted ({len(single_game_shots_df.columns)}):")
print(single_game_shots_df.columns.tolist())

# Check what we got
print(f"\n{'='*60}")
print("DATA QUALITY CHECK")
print(f"{'='*60}")
for col in ['made', 'distance', 'x', 'y', 'shooter_name', 'three_pointer', 'is_three_from_text']:
    if col in single_game_shots_df.columns:
        non_null = single_game_shots_df[col].notna().sum()
        if col == 'is_three_from_text':
            # Boolean - count True values
            true_count = single_game_shots_df[col].sum()
            print(f"  {col}: {true_count} three-pointers found")
        else:
            print(f"  {col}: {non_null}/{len(single_game_shots_df)} ({non_null/len(single_game_shots_df)*100:.1f}%)")

Processing game: 214340
Shooting plays to process: 147

EXTRACTION RESULTS
Total shots extracted: 147

Columns extracted (24):
['play_id', 'game_id', 'period', 'clock', 'seconds_remaining', 'team', 'team_id', 'opponent', 'is_home_team', 'play_type', 'play_text', 'score_value', 'home_score', 'away_score', 'distance', 'is_three_from_text', 'made_from_text', 'made', 'assisted', 'three_pointer', 'shooter_name', 'shooter_id', 'x', 'y']

DATA QUALITY CHECK
  made: 147/147 (100.0%)
  distance: 77/147 (52.4%)
  x: 114/147 (77.6%)
  y: 114/147 (77.6%)
  shooter_name: 147/147 (100.0%)
  three_pointer: 0/147 (0.0%)
  is_three_from_text: 36 three-pointers found


In [None]:
# Step 5f: Extract players from nested structure
all_players_flat = []

for idx, row in game_players_df.iterrows():
    team = row['team']
    players_list = row['players']
    
    if players_list:
        for player in players_list:
            player['team'] = team  # Add team to each player
            all_players_flat.append(player)

players_flat_df = pd.DataFrame(all_players_flat)

print(f"Total players: {len(players_flat_df)}")
print(f"\nColumns: {players_flat_df.columns.tolist()}")

# Show starters
print(f"\n{'='*60}")
print("STARTERS")
print(f"{'='*60}")
for team in players_flat_df['team'].unique():
    team_starters = players_flat_df[(players_flat_df['team'] == team) & (players_flat_df['starter'] == True)]
    print(f"\n{team} ({len(team_starters)}):")
    for idx, p in team_starters.iterrows():
        print(f"  {p['position']} - {p['name']} ({p['minutes']} min)")


Game date: 2026-01-07 00:00:00+00:00
Teams: ['Butler' "St. John's"]
  Butler: 1 players
  St. John's: 1 players

Total players: 2

Columns available:
  - gameId
  - season
  - seasonLabel
  - seasonType
  - startDate
  - startTimeTbd
  - teamId
  - team
  - conference
  - opponentId
  - opponent
  - opponentConference
  - neutralSite
  - conferenceGame
  - gameType
  - gameMinutes
  - gamePace
  - players
  - notes

SAMPLE PLAYER DATA
   gameId  season seasonLabel          seasonType                 startDate  startTimeTbd  teamId        team conference  opponentId    opponent opponentConference  neutralSite  conferenceGame gameType  gameMinutes  gamePace                                                                                                                                                                                                                                                                                                                                                 

In [19]:
# Step 5f: Extract players from nested structure
all_players_flat = []

for idx, row in game_players_df.iterrows():
    team = row['team']
    players_list = row['players']
    
    if players_list:
        for player in players_list:
            player['team'] = team  # Add team to each player
            all_players_flat.append(player)

players_flat_df = pd.DataFrame(all_players_flat)

print(f"Total players: {len(players_flat_df)}")
print(f"\nColumns: {players_flat_df.columns.tolist()}")

# Show starters
print(f"\n{'='*60}")
print("STARTERS")
print(f"{'='*60}")
for team in players_flat_df['team'].unique():
    team_starters = players_flat_df[(players_flat_df['team'] == team) & (players_flat_df['starter'] == True)]
    print(f"\n{team} ({len(team_starters)}):")
    for idx, p in team_starters.iterrows():
        print(f"  {p['position']} - {p['name']} ({p['minutes']} min)")


Total players: 19

Columns: ['rebounds', 'freeThrows', 'threePointFieldGoals', 'twoPointFieldGoals', 'fieldGoals', 'offensiveReboundPct', 'freeThrowRate', 'assistsTurnoverRatio', 'gameScore', 'trueShootingPct', 'effectiveFieldGoalPct', 'netRating', 'defensiveRating', 'offensiveRating', 'usage', 'blocks', 'steals', 'assists', 'fouls', 'turnovers', 'points', 'minutes', 'ejected', 'starter', 'position', 'name', 'athleteSourceId', 'athleteId', 'team']

STARTERS

Butler (5):
  G - Michael Ajayi (30 min)
  F - Drayton Jones (21 min)
  G - Azavier Robinson (17 min)
  G - Jamie Kaiser Jr. (28 min)
  G - Finley Bizjack (35 min)

St. John's (5):
  F - Dillon Mitchell (33 min)
  F - Zuby Ejiofor (30 min)
  F - Bryce Hopkins (24 min)
  G - Oziyah Sellers (33 min)
  G - Ian Jackson (22 min)


In [None]:
# Step 5g: Build lineup tracker
# Parse substitutions to track who's on court

import re

def parse_substitution(play_text):
    """
    Parse substitution text.
    Format: "Player Name subbing in/out for Team"
    Returns: {'player': name, 'action': 'in' or 'out', 'team': team_name}
    """
    # Pattern: "Player Name subbing in for Team" or "Player Name subbing out for Team"
    pattern = r'(.+?)\s+subbing\s+(in|out)\s+for\s+(.+?)$'
    match = re.search(pattern, play_text, re.IGNORECASE)
    if match:
        return {
            'player': match.group(1).strip(),
            'action': match.group(2).lower(),
            'team': match.group(3).strip()
        }
    return None

# Get substitution plays from our single game
subs = single_game_df[single_game_df['playType'] == 'Substitution'].copy()
print(f"Total substitutions in game: {len(subs)}")

# Test on sample subs
print("\nTesting substitution parsing:")
print("="*60)
for idx, sub in subs.head(10).iterrows():
    result = parse_substitution(sub['playText'])
    print(f"Text: {sub['playText']}")
    print(f"Parsed: {result}")
    print()

# Count in vs out
subs['parsed'] = subs['playText'].apply(parse_substitution)
subs_valid = subs[subs['parsed'].notna()]
in_count = sum(1 for p in subs_valid['parsed'] if p['action'] == 'in')
out_count = sum(1 for p in subs_valid['parsed'] if p['action'] == 'out')
print(f"Summary: {in_count} 'in', {out_count} 'out' (should be equal)")
print(f"Parse failures: {len(subs) - len(subs_valid)}")

In [None]:
# Step 5h: Track full lineups throughout the game
# Use REAL starters from API, then track substitutions
# Process all subs at same timestamp together before logging

def track_lineups_with_real_starters(game_df, starters_by_team):
    """Track which 5 players are on court for each team throughout the game."""
    # Get teams
    teams = game_df[game_df['team'].notna()]['team'].unique()
    print(f"Teams: {list(teams)}")
    
    # Initialize lineups with REAL starters from API
    lineups = {}
    for team in teams:
        if team in starters_by_team:
            lineups[team] = set(starters_by_team[team])
            print(f"\n{team} starters (from API): {len(lineups[team])}")
            for p in lineups[team]:
                print(f"  - {p}")
        else:
            lineups[team] = set()
            print(f"\n{team}: No starters found in API data")
    
    # Group plays by (period, clock) to process subs together
    game_df_sorted = game_df.sort_values(['period', 'secondsRemaining'], ascending=[True, False])
    
    lineup_log = []
    current_period = None
    current_clock = None
    pending_plays = []
    
    def process_pending_plays():
        """Process all pending plays and log lineup state once at the end."""
        nonlocal pending_plays
        if not pending_plays:
            return
            
        # First, apply all substitutions
        for play in pending_plays:
            if play.get('playType') == 'Substitution':
                team = play.get('team')
                sub_info = parse_substitution(play.get('playText', ''))
                if sub_info and team:
                    player = sub_info['player']
                    action = sub_info['action']
                    if action == 'in':
                        lineups[team].add(player)
                    elif action == 'out':
                        lineups[team].discard(player)
        
        # Then log lineup state for each play at this timestamp
        for play in pending_plays:
            lineup_log.append({
                'play_id': play.get('id'),
                'period': play.get('period'),
                'clock': play.get('clock'),
                'seconds_remaining': play.get('secondsRemaining'),
                'play_type': play.get('playType'),
                'team': play.get('team'),
                'home_lineup': list(lineups.get(teams[0], set())) if len(teams) > 0 else [],
                'away_lineup': list(lineups.get(teams[1], set())) if len(teams) > 1 else [],
                'home_lineup_size': len(lineups.get(teams[0], set())) if len(teams) > 0 else 0,
                'away_lineup_size': len(lineups.get(teams[1], set())) if len(teams) > 1 else 0,
            })
        
        pending_plays = []
    
    # Process plays
    for idx, play in game_df_sorted.iterrows():
        period = play.get('period')
        clock = play.get('clock')
        
        # If we're at a new timestamp, process pending plays first
        if (period, clock) != (current_period, current_clock):
            process_pending_plays()
            current_period = period
            current_clock = clock
        
        pending_plays.append(play.to_dict())
    
    # Process final batch
    process_pending_plays()
    
    return pd.DataFrame(lineup_log), lineups

# Create starters_by_team if not already created
if 'starters_by_team' not in locals():
    starters_by_team = {}
    for team in players_flat_df['team'].unique():
        team_starters = players_flat_df[(players_flat_df['team'] == team) & (players_flat_df['starter'] == True)]
        starters_by_team[team] = team_starters['name'].tolist()

# Run lineup tracking with real starters
lineup_df, final_lineups = track_lineups_with_real_starters(single_game_df, starters_by_team)

print(f"\n{'='*60}")
print("LINEUP TRACKING RESULTS")
print(f"{'='*60}")
print(f"Total plays tracked: {len(lineup_df)}")

# Check lineup sizes (should stay at 5)
print(f"\nLineup size check (should be exactly 5 per team):")
for col in ['home_lineup_size', 'away_lineup_size']:
    sizes = lineup_df[col]
    print(f"  {col}: min={sizes.min()}, max={sizes.max()}, mode={sizes.mode().iloc[0]}")
    
# Show any plays where lineup size != 5
bad_lineups = lineup_df[(lineup_df['home_lineup_size'] != 5) | (lineup_df['away_lineup_size'] != 5)]
if len(bad_lineups) > 0:
    print(f"\n⚠️  {len(bad_lineups)} plays with lineup size != 5")
    print(bad_lineups[['period', 'clock', 'play_type', 'home_lineup_size', 'away_lineup_size']].head(10))
else:
    print(f"\n✓ All plays have exactly 5 players per team!")

In [None]:
# Step 5j: Create FLAT dataframes for analysis

# Helper to flatten lineup lists into string keys
def lineup_to_key(lineup_list):
    """Convert lineup list to sorted string for grouping."""
    if isinstance(lineup_list, list):
        return ' | '.join(sorted(lineup_list))
    return None

# 1. SHOTS_DF - All shooting plays with extracted details + lineups (FLAT)
shots_df = pbp_with_lineups[pbp_with_lineups['shootingPlay'] == True].copy()

# Extract shot details from shotInfo
for idx, row in shots_df.iterrows():
    shot_info = row.get('shotInfo')
    if shot_info and isinstance(shot_info, dict):
        shots_df.at[idx, 'shooter_name'] = shot_info.get('shooter', {}).get('name')
        shots_df.at[idx, 'shooter_id'] = shot_info.get('shooter', {}).get('id')
        shots_df.at[idx, 'made'] = shot_info.get('made')
        shots_df.at[idx, 'assisted'] = shot_info.get('assisted')
        shots_df.at[idx, 'assisted_by'] = shot_info.get('assistedBy', {}).get('name')
        shots_df.at[idx, 'shot_range'] = shot_info.get('range')
        shots_df.at[idx, 'x'] = shot_info.get('location', {}).get('x')
        shots_df.at[idx, 'y'] = shot_info.get('location', {}).get('y')

# Parse distance from text
shots_df['distance'] = shots_df['playText'].str.extract(r'(\d+)-foot').astype(float)
shots_df['is_three'] = shots_df['playText'].str.contains('three point', case=False, na=False)

# Flatten lineups to strings
shots_df['home_lineup_key'] = shots_df['home_lineup'].apply(lineup_to_key)
shots_df['away_lineup_key'] = shots_df['away_lineup'].apply(lineup_to_key)

# Drop the list columns, keep flat ones
shots_df = shots_df.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"SHOTS_DF: {len(shots_df)} shots")
print(f"  Key columns: shooter_name, made, x, y, distance, is_three, home_lineup_key, away_lineup_key")

# 2. PLAYERS_DF - Player stats from this game (FLAT - already flat!)
players_df = players_flat_df.copy()

# Flatten nested stat dicts
for col in ['rebounds', 'freeThrows', 'threePointFieldGoals', 'twoPointFieldGoals', 'fieldGoals']:
    if col in players_df.columns:
        for idx, row in players_df.iterrows():
            if isinstance(row[col], dict):
                for key, val in row[col].items():
                    players_df.at[idx, f'{col}_{key}'] = val
        players_df = players_df.drop(columns=[col])

print(f"\nPLAYERS_DF: {len(players_df)} players")
print(f"  Key columns: name, team, starter, position, minutes, points, assists, steals, blocks")

# 3. LINEUP_STINTS_DF - Track when each lineup was on court (FLAT)
def get_lineup_stints(pbp_df):
    """Calculate stint lengths for each unique lineup combination."""
    stints = []
    
    prev_home = None
    prev_away = None
    stint_start = None
    stint_start_score = None
    
    for idx, row in pbp_df.iterrows():
        home = tuple(sorted(row['home_lineup']))
        away = tuple(sorted(row['away_lineup']))
        
        if (home, away) != (prev_home, prev_away):
            if prev_home is not None:
                stints.append({
                    'home_lineup_key': ' | '.join(prev_home),
                    'away_lineup_key': ' | '.join(prev_away),
                    'start_seconds': stint_start,
                    'end_seconds': row['secondsRemaining'],
                    'start_home_score': stint_start_score[0],
                    'start_away_score': stint_start_score[1],
                    'end_home_score': row['homeScore'],
                    'end_away_score': row['awayScore'],
                })
            
            prev_home = home
            prev_away = away
            stint_start = row['secondsRemaining']
            stint_start_score = (row['homeScore'], row['awayScore'])
    
    if prev_home is not None:
        stints.append({
            'home_lineup_key': ' | '.join(prev_home),
            'away_lineup_key': ' | '.join(prev_away),
            'start_seconds': stint_start,
            'end_seconds': 0,
            'start_home_score': stint_start_score[0],
            'start_away_score': stint_start_score[1],
            'end_home_score': pbp_df.iloc[-1]['homeScore'],
            'end_away_score': pbp_df.iloc[-1]['awayScore'],
        })
    
    return pd.DataFrame(stints)

lineup_stints_df = get_lineup_stints(pbp_with_lineups.sort_values('secondsRemaining', ascending=False))
lineup_stints_df['duration_seconds'] = lineup_stints_df['start_seconds'] - lineup_stints_df['end_seconds']
lineup_stints_df['home_pts_scored'] = lineup_stints_df['end_home_score'] - lineup_stints_df['start_home_score']
lineup_stints_df['away_pts_scored'] = lineup_stints_df['end_away_score'] - lineup_stints_df['start_away_score']
lineup_stints_df['home_plus_minus'] = lineup_stints_df['home_pts_scored'] - lineup_stints_df['away_pts_scored']

print(f"\nLINEUP_STINTS_DF: {len(lineup_stints_df)} stints")
print(f"  Key columns: home_lineup_key, away_lineup_key, duration_seconds, home_plus_minus")

# 4. PBP_FLAT - Full PBP flattened
pbp_flat = pbp_with_lineups.copy()
pbp_flat['home_lineup_key'] = pbp_flat['home_lineup'].apply(lineup_to_key)
pbp_flat['away_lineup_key'] = pbp_flat['away_lineup'].apply(lineup_to_key)
pbp_flat = pbp_flat.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"\nPBP_FLAT: {len(pbp_flat)} plays")
print(f"  Key columns: playType, playText, homeScore, awayScore, home_lineup_key, away_lineup_key")

# Summary
print(f"\n{'='*60}")
print("FLAT DATAFRAMES READY FOR ANALYSIS")
print(f"{'='*60}")
print(f"  pbp_flat          : {len(pbp_flat):>4} rows - Full PBP (no nested objects)")
print(f"  shots_df          : {len(shots_df):>4} rows - Shots with x/y, made, lineups")
print(f"  players_df        : {len(players_df):>4} rows - Player box score stats")
print(f"  lineup_stints_df  : {len(lineup_stints_df):>4} rows - Lineup stints with +/-")

In [36]:
# Step 5j: Create FLAT dataframes for analysis

# Helper to flatten lineup lists into string keys
def lineup_to_key(lineup_list):
    """Convert lineup list to sorted string for grouping."""
    if isinstance(lineup_list, list):
        return ' | '.join(sorted(lineup_list))
    return None

# 1. SHOTS_DF - All shooting plays with extracted details + lineups (FLAT)
shots_df = pbp_with_lineups[pbp_with_lineups['shootingPlay'] == True].copy()

# Extract shot details from shotInfo
for idx, row in shots_df.iterrows():
    shot_info = row.get('shotInfo')
    if shot_info and isinstance(shot_info, dict):
        shots_df.at[idx, 'shooter_name'] = shot_info.get('shooter', {}).get('name')
        shots_df.at[idx, 'shooter_id'] = shot_info.get('shooter', {}).get('id')
        shots_df.at[idx, 'made'] = shot_info.get('made')
        shots_df.at[idx, 'assisted'] = shot_info.get('assisted')
        shots_df.at[idx, 'assisted_by'] = shot_info.get('assistedBy', {}).get('name')
        shots_df.at[idx, 'shot_range'] = shot_info.get('range')
        shots_df.at[idx, 'x'] = shot_info.get('location', {}).get('x')
        shots_df.at[idx, 'y'] = shot_info.get('location', {}).get('y')

# Parse distance from text
shots_df['distance'] = shots_df['playText'].str.extract(r'(\d+)-foot').astype(float)
shots_df['is_three'] = shots_df['playText'].str.contains('three point', case=False, na=False)

# Flatten lineups to strings
shots_df['home_lineup_key'] = shots_df['home_lineup'].apply(lineup_to_key)
shots_df['away_lineup_key'] = shots_df['away_lineup'].apply(lineup_to_key)

# Drop the list columns, keep flat ones
shots_df = shots_df.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"SHOTS_DF: {len(shots_df)} shots")

# 2. PLAYERS_DF - Flatten nested stat dicts
players_df = players_flat_df.copy()
for col in ['rebounds', 'freeThrows', 'threePointFieldGoals', 'twoPointFieldGoals', 'fieldGoals']:
    if col in players_df.columns:
        for idx, row in players_df.iterrows():
            if isinstance(row[col], dict):
                for key, val in row[col].items():
                    players_df.at[idx, f'{col}_{key}'] = val
        players_df = players_df.drop(columns=[col])

print(f"PLAYERS_DF: {len(players_df)} players")

# 3. LINEUP_STINTS_DF - Flat
def get_lineup_stints(pbp_df):
    stints = []
    prev_home, prev_away, stint_start, stint_start_score = None, None, None, None
    
    for idx, row in pbp_df.iterrows():
        home = tuple(sorted(row['home_lineup']))
        away = tuple(sorted(row['away_lineup']))
        
        if (home, away) != (prev_home, prev_away):
            if prev_home is not None:
                stints.append({
                    'home_lineup_key': ' | '.join(prev_home),
                    'away_lineup_key': ' | '.join(prev_away),
                    'start_seconds': stint_start,
                    'end_seconds': row['secondsRemaining'],
                    'start_home_score': stint_start_score[0],
                    'start_away_score': stint_start_score[1],
                    'end_home_score': row['homeScore'],
                    'end_away_score': row['awayScore'],
                })
            prev_home, prev_away = home, away
            stint_start = row['secondsRemaining']
            stint_start_score = (row['homeScore'], row['awayScore'])
    
    if prev_home is not None:
        stints.append({
            'home_lineup_key': ' | '.join(prev_home),
            'away_lineup_key': ' | '.join(prev_away),
            'start_seconds': stint_start, 'end_seconds': 0,
            'start_home_score': stint_start_score[0],
            'start_away_score': stint_start_score[1],
            'end_home_score': pbp_df.iloc[-1]['homeScore'],
            'end_away_score': pbp_df.iloc[-1]['awayScore'],
        })
    return pd.DataFrame(stints)

lineup_stints_df = get_lineup_stints(pbp_with_lineups.sort_values('secondsRemaining', ascending=False))
lineup_stints_df['duration_seconds'] = lineup_stints_df['start_seconds'] - lineup_stints_df['end_seconds']
lineup_stints_df['home_plus_minus'] = (lineup_stints_df['end_home_score'] - lineup_stints_df['start_home_score']) - \
                                       (lineup_stints_df['end_away_score'] - lineup_stints_df['start_away_score'])

print(f"LINEUP_STINTS_DF: {len(lineup_stints_df)} stints")

# 4. PBP_FLAT
pbp_flat = pbp_with_lineups.copy()
pbp_flat['home_lineup_key'] = pbp_flat['home_lineup'].apply(lineup_to_key)
pbp_flat['away_lineup_key'] = pbp_flat['away_lineup'].apply(lineup_to_key)
pbp_flat = pbp_flat.drop(columns=['home_lineup', 'away_lineup', 'shotInfo', 'participants'])

print(f"PBP_FLAT: {len(pbp_flat)} plays")

print(f"\n{'='*60}")
print("FLAT DATAFRAMES READY")
print(f"{'='*60}")


SHOTS_DF: 147 shots
PLAYERS_DF: 19 players
LINEUP_STINTS_DF: 153 stints
PBP_FLAT: 429 plays

FLAT DATAFRAMES READY


In [None]:
# Step 5k: Track possessions and how they end
# A possession ends with: made shot, turnover, defensive rebound, end of period, or opponent foul shots

def track_possessions(game_df):
    """
    Track each possession and how it ended.
    Returns DataFrame with possession_id, team, and outcome for each play.
    """
    game_df = game_df.sort_values(['period', 'secondsRemaining'], ascending=[True, False]).copy()
    
    possessions = []
    possession_id = 0
    current_team = None
    possession_plays = []
    
    # Possession-ending events
    def get_possession_outcome(play):
        """Determine if/how this play ends a possession."""
        play_type = play.get('playType', '')
        play_text = play.get('playText', '').lower()
        
        # Made shot (not free throw) - possession ends
        if play_type in ['JumpShot', 'LayUpShot', 'DunkShot', 'TipShot']:
            if 'makes' in play_text or 'made' in play_text:
                return 'made_fg'
            # Miss doesn't end possession (need rebound)
            return None
        
        # Made free throw on last attempt ends possession
        if play_type == 'MadeFreeThrow':
            if 'makes' in play_text and ('3 of 3' in play_text or '2 of 2' in play_text or '1 of 1' in play_text):
                return 'made_ft'
            if 'misses' in play_text and ('3 of 3' in play_text or '2 of 2' in play_text or '1 of 1' in play_text):
                return 'missed_ft'  # Last FT miss - need rebound
            return None
        
        # Turnovers end possession
        if 'Turnover' in play_type or 'turnover' in play_text:
            return 'turnover'
        
        # Defensive rebound ends possession for shooting team
        if play_type == 'Defensive Rebound':
            return 'missed_fg'  # The miss + def rebound = possession over
        
        # End of period
        if play_type in ['End Period', 'End Game']:
            return 'end_period'
        
        return None
    
    for idx, play in game_df.iterrows():
        team = play.get('team')
        outcome = get_possession_outcome(play)
        play_type = play.get('playType', '')
        
        # Skip non-team plays for possession tracking
        if team is None and play_type not in ['End Period', 'End Game']:
            continue
        
        # Check for possession change
        if team and team != current_team and play_type != 'Defensive Rebound':
            # New possession starting
            if current_team is not None:
                possession_id += 1
            current_team = team
        
        # Record this play's possession
        possessions.append({
            'play_id': play.get('id'),
            'possession_id': possession_id,
            'possession_team': current_team,
            'play_type': play_type,
            'outcome': outcome
        })
        
        # If possession ended, increment for next one
        if outcome and outcome != 'end_period':
            possession_id += 1
            # Determine who gets ball next
            if outcome in ['made_fg', 'made_ft']:
                # Other team gets ball
                teams = game_df[game_df['team'].notna()]['team'].unique()
                current_team = [t for t in teams if t != current_team][0] if len(teams) > 1 else None
            elif outcome == 'turnover':
                # Other team gets ball
                teams = game_df[game_df['team'].notna()]['team'].unique()
                current_team = [t for t in teams if t != current_team][0] if len(teams) > 1 else None
            elif outcome == 'missed_fg':
                # Defensive rebound means other team has it
                current_team = team  # The rebounding team
    
    return pd.DataFrame(possessions)

# Run possession tracking
possessions_df = track_possessions(single_game_df)

# Merge with full PBP
pbp_with_possessions = pbp_with_lineups.merge(
    possessions_df[['play_id', 'possession_id', 'possession_team', 'outcome']],
    left_on='id',
    right_on='play_id',
    how='left'
)

print(f"Total plays: {len(pbp_with_possessions)}")
print(f"Total possessions: {possessions_df['possession_id'].nunique()}")

# Summarize possession outcomes
print(f"\n{'='*60}")
print("POSSESSION OUTCOMES")
print(f"{'='*60}")
outcomes = possessions_df[possessions_df['outcome'].notna()]['outcome'].value_counts()
print(outcomes)

# Show sample possessions
print(f"\n{'='*60}")
print("SAMPLE POSSESSIONS")
print(f"{'='*60}")
for poss_id in range(3):
    poss_plays = pbp_with_possessions[pbp_with_possessions['possession_id'] == poss_id]
    if len(poss_plays) > 0:
        team = poss_plays['possession_team'].iloc[0]
        outcome = poss_plays['outcome'].dropna().iloc[-1] if poss_plays['outcome'].notna().any() else 'ongoing'
        print(f"\nPossession {poss_id} ({team}) → {outcome}")
        for idx, p in poss_plays.iterrows():
            print(f"  {p['clock']} - {p['playType']}: {p['playText'][:50]}...")

In [37]:
pbp_flat

Unnamed: 0,id,sourceId,gameId,gameSourceId,gameStartDate,season,seasonType,gameType,playType,period,...,team,conference,opponentId,opponent,opponentConference,play_id,home_lineup_size,away_lineup_size,home_lineup_key,away_lineup_key
0,43562565,401822888116889907,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,,,,,,43562565,5,5,Azavier Robinson | Drayton Jones | Finley Bizj...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
1,43562567,401822888116889909,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Butler,Big East,279.0,St. John's,Big East,43562567,5,5,Azavier Robinson | Drayton Jones | Finley Bizj...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
2,43562566,401822888116889908,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,St. John's,Big East,34.0,Butler,Big East,43562566,5,5,Azavier Robinson | Drayton Jones | Finley Bizj...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
3,43562568,401822888116889913,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,1,...,St. John's,Big East,34.0,Butler,Big East,43562568,5,5,Azavier Robinson | Drayton Jones | Finley Bizj...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
4,43562569,401822888116889914,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,1,...,Butler,Big East,279.0,St. John's,Big East,43562569,5,5,Azavier Robinson | Drayton Jones | Finley Bizj...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
424,43563343,401822888116900166,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,2,...,St. John's,Big East,34.0,Butler,Big East,43563343,5,5,Efeosa Oliogu-Elabor | Evan Haywood | Finley B...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
425,43563344,401822888116900167,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,2,...,Butler,Big East,279.0,St. John's,Big East,43563344,5,5,Efeosa Oliogu-Elabor | Evan Haywood | Finley B...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
426,43563346,401822888116900172,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Period,2,...,,,,,,43563346,5,5,Efeosa Oliogu-Elabor | Evan Haywood | Finley B...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...
427,43563347,401822888116900175,214340,401822888,2026-01-07 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Game,2,...,,,,,,43563347,5,5,Efeosa Oliogu-Elabor | Evan Haywood | Finley B...,Bryce Hopkins | Dillon Mitchell | Ian Jackson ...


In [None]:
# Step 6: Save play-by-play data to CSV
output_dir = 'cbbd_data'

# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Create a timestamp for the filename
date_str = search_date.strftime('%Y%m%d')

# Save games data
if 'games_df' in locals() and len(games_df) > 0:
    filename = f'{output_dir}/games_{date_str}_{season}.csv'
    games_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(games_df)} games to {filename}")

# Save all plays data
if 'plays_df' in locals() and len(plays_df) > 0:
    filename = f'{output_dir}/plays_{date_str}_{season}.csv'
    plays_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(plays_df):,} plays to {filename}")

# Save shot locations data
if 'shot_locations_df' in locals() and len(shot_locations_df) > 0:
    filename = f'{output_dir}/shot_locations_{date_str}_{season}.csv'
    shot_locations_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(shot_locations_df)} shot locations to {filename}")

print(f"\n{'='*50}")
print(f"All data saved to {output_dir}/")
print(f"{'='*50}")

Created directory: cbbd_data
✓ Saved 23 games to cbbd_data/games_20251230_2026.csv
✓ Saved 11,191 plays to cbbd_data/plays_20251230_2026.csv
✓ Saved 3746 shot locations to cbbd_data/shot_locations_20251230_2026.csv

All data saved to cbbd_data/


In [None]:
# Summary statistics for collected data
if 'team_plays_df' in locals() and len(team_plays_df) > 0:
    print("=" * 50)
    print("DATA SUMMARY")
    print("=" * 50)
    print(f"\nTeam: {team_name}")
    print(f"Season: {season-1}/{season}")
    print(f"Total Plays: {len(team_plays_df):,}")
    print(f"Total Games: {team_plays_df['game_id'].nunique()}")
    
    print("\nPlay Type Distribution:")
    print(team_plays_df['type'].value_counts())
    
    # Check for missing data
    print("\nMissing Data Check:")
    print(team_plays_df.isnull().sum())
    
    # Period distribution
    if 'period' in team_plays_df.columns:
        print("\nPlays by Period:")
        print(team_plays_df['period'].value_counts().sort_index())
    
    print("\n" + "=" * 50)

## Appendix A: Get Play Types Reference

Understanding what play types are available in the data

## Appendix B: Batch Collection for Multiple Teams

Automate data collection for multiple teams at once

In [None]:
# Define teams to collect data for
teams = ['Duke', 'North Carolina', 'Kansas', 'Kentucky', 'Gonzaga']

# Dictionary to store all team data
all_team_data = {}

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    for team in teams:
        try:
            print(f"\nCollecting data for {team}...")
            
            # Get all plays for the team
            team_plays = plays_api.get_plays_by_team(
                season=season,
                team=team
            )
            
            # Convert to DataFrame
            df = pd.DataFrame([p.to_dict() for p in team_plays])
            all_team_data[team] = df
            
            print(f"  Collected {len(df)} plays for {team}")
            
            # Optional: add a small delay to avoid rate limiting
            import time
            time.sleep(0.5)
            
        except ApiException as e:
            print(f"  Error collecting data for {team}: {e}")

print(f"\nTotal teams collected: {len(all_team_data)}")

## Appendix C: Calculate Team Shooting Efficiency

Example analysis: Calculate shooting statistics from play-by-play data

In [None]:
# Calculate shooting statistics from play-by-play data
if 'shooting_df' in locals() and len(shooting_df) > 0:
    # Filter for made/missed shots
    shot_results = shooting_df[shooting_df['type'].str.contains('shot|three|jumper|layup|dunk', case=False, na=False)]
    
    if len(shot_results) > 0:
        # Calculate shooting percentages
        total_shots = len(shot_results)
        made_shots = len(shot_results[shot_results.get('made', False) == True])
        
        fg_pct = (made_shots / total_shots * 100) if total_shots > 0 else 0
        
        print(f"\nShooting Statistics for {team_name}:")
        print(f"  Total Shot Attempts: {total_shots}")
        print(f"  Made Shots: {made_shots}")
        print(f"  Field Goal %: {fg_pct:.2f}%")
        
        # Three-point shooting
        three_pt_shots = shot_results[shot_results['type'].str.contains('three', case=False, na=False)]
        if len(three_pt_shots) > 0:
            three_made = len(three_pt_shots[three_pt_shots.get('made', False) == True])
            three_pct = (three_made / len(three_pt_shots) * 100)
            print(f"\n  Three-Point Attempts: {len(three_pt_shots)}")
            print(f"  Three-Pointers Made: {three_made}")
            print(f"  Three-Point %: {three_pct:.2f}%")

## Appendix D: Tips and Best Practices

### Rate Limiting
- Be mindful of API rate limits when making bulk requests
- Add small delays between requests when collecting data for multiple teams
- Use the `_with_http_info` method variant to check remaining API calls:

```python
response = plays_api.get_plays_by_team_with_http_info(season=season, team=team)
remaining_calls = response.headers.get('X-CallLimit-Remaining')
```

### Data Storage
- Save data incrementally to avoid losing progress
- Use parquet format for large datasets (more efficient than CSV)
- Include timestamps in filenames for version control

### Season Timing
- NCAA basketball season typically runs November through April
- Use `season=2026` for the 2025/26 academic year
- Check game schedules before attempting to collect data

### Error Handling
- Always wrap API calls in try-except blocks
- Log errors for debugging
- Implement retry logic for transient failures

---

## Next Steps

Now that you have play-by-play data, you can:
1. Create shot charts using matplotlib/seaborn
2. Calculate advanced metrics (offensive rating, pace, etc.)
3. Analyze lineup performance
4. Build predictive models
5. Track player development over the season

For shot chart creation, check out the CBBD blog post: https://blog.collegefootballdata.com/talking-tech-generating-shot-charts-using-the-basketball-api/

---

**End of Notebook**