# College Basketball Play-by-Play Data Collection
## Using CollegeBasketballData.com API for 2025/26 Season

This notebook demonstrates how to use the CollegeBasketballData.com (CBBD) API to collect play-by-play data for the 2025/26 college basketball season.

### Prerequisites
- Python 3.7+
- API key from CollegeBasketballData.com (register at https://collegebasketballdata.com)
- Install required packages: `pip install cbbd pandas numpy`

## 1. Setup and Import Libraries

In [18]:
# Import required libraries
import cbbd
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from cbbd.rest import ApiException
from pprint import pprint

# Change working directory to Desktop
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
os.chdir(desktop_path)
print(f"Working directory changed to: {os.getcwd()}")

Working directory changed to: C:\Users\shank.subramani_betf\Desktop


## 2. Configure API Authentication

The CBBD API uses Bearer token authentication. You'll be prompted to enter your API key securely.

In [2]:
# Configure API key
import getpass

# Prompt user for API key (input will be hidden for security)
print("Please enter your CollegeBasketballData.com API key")
print("(Get your free key at: https://collegebasketballdata.com)")
api_key = getpass.getpass("API Key: ")

# Alternatively, use environment variable if already set
# api_key = os.environ.get('CBBD_API_KEY')

# Validate that a key was entered
if not api_key or api_key.strip() == '':
    raise ValueError("API key is required. Please run this cell again and enter your key.")

# Configure the API client
configuration = cbbd.Configuration(
    host="https://api.collegebasketballdata.com",
    access_token=api_key
)

print("✓ API configuration completed successfully!")

Please enter your CollegeBasketballData.com API key
(Get your free key at: https://collegebasketballdata.com)
✓ API configuration completed successfully!


## 3. Set Season Parameter

For the 2025/26 season, use `season=2026` (NCAA uses the end year of the academic year)

In [3]:
# Set the season (use 2026 for 2025/26 season)
season = 2026
print(f"Season set to: {season-1}/{season}")

Season set to: 2025/2026


## 4. Choose Your Data Collection Method

The CBBD API provides several endpoints for accessing play-by-play data. Choose the method that best fits your needs:

- **Method 1 (4a)**: Get plays for a **specific game** (when you know the game ID)
- **Method 2 (4b)**: Get plays by **date range** (all games between two dates)
- **Method 3 (4c)**: Get plays by **team** (entire season for one team)
- **Method 4 (4d)**: Get **shooting plays only** (for shot charts and shooting analysis)

**You only need to run ONE of the methods below (4a, 4b, 4c, or 4d) based on your use case.**

### 4a. Method 1: Get Plays for a Specific Game

First, find games for a specific date to get game IDs, then retrieve plays for a specific game

In [34]:
# Combined: Get games for a date and collect play-by-play for all of them
from datetime import datetime
import time

# Prompt user for the date
date_input = input("Enter date to search for games (YYYY-MM-DD) or press Enter for today: ").strip()

# Use today's date if nothing entered
if not date_input:
    search_date = datetime.now()
else:
    # Parse the string into a datetime object
    search_date = datetime.strptime(date_input, '%Y-%m-%d')

with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get games for the specified date
        games = games_api.get_games(
            start_date_range=search_date,
            end_date_range=search_date,
            season=season
        )
        
        # Convert to DataFrame
        games_df = pd.DataFrame([g.to_dict() for g in games])
        
        if len(games_df) > 0:
            # Collect play-by-play for all games
            all_plays = []
            
            for idx, game in games_df.iterrows():
                game_id = game['id']
                
                try:
                    # Get plays for this game
                    plays = plays_api.get_plays(game_id=game_id)
                    
                    # Convert to list of dicts
                    game_plays = [p.to_dict() for p in plays]
                    
                    # Add to our collection
                    all_plays.extend(game_plays)
                    
                    # Small delay to be nice to the API
                    time.sleep(0.5)
                    
                except ApiException as e:
                    continue
            
            # Convert all plays to DataFrame
            plays_df = pd.DataFrame(all_plays)
            
        else:
            plays_df = pd.DataFrame()
        
    except ApiException as e:
        games_df = pd.DataFrame()
        plays_df = pd.DataFrame()

# Quick summary
print(f"Games: {len(games_df)}, Plays: {len(plays_df):,}")

Games: 0, Plays: 0


In [33]:
plays_df

Unnamed: 0,id,sourceId,gameId,gameSourceId,gameStartDate,season,seasonType,gameType,playType,period,...,playText,participants,isHomeTeam,teamId,team,conference,opponentId,opponent,opponentConference,shotInfo
0,39735570,401829440114718653,215404,401829440,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Start game,[],,,,,,,,
1,39735572,401829440114718659,215404,401829440,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball lost by Erskine,"[{'name': 'Ruben Salazar', 'id': 211483}]",False,497.0,Erskine,,356.0,Wofford,SoCon,
2,39735571,401829440114718656,215404,401829440,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Jumpball,1,...,Jump Ball won by Wofford,"[{'name': 'Rex Stirling', 'id': 203476}]",True,356.0,Wofford,SoCon,497.0,Erskine,,
3,39735573,401829440114718729,215404,401829440,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,1,...,Chace Watley missed Three Point Jumper.,"[{'name': 'Chace Watley', 'id': 203480}]",True,356.0,Wofford,SoCon,497.0,Erskine,,"{'shooter': {'name': 'Chace Watley', 'id': 203..."
4,39735574,401829440114718790,215404,401829440,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Offensive Rebound,1,...,Rex Stirling Offensive Rebound.,"[{'name': 'Rex Stirling', 'id': 203476}]",True,356.0,Wofford,SoCon,497.0,Erskine,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7723,39739599,401820809114744828,213366,401820809,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,JumpShot,2,...,Brycen Blaine missed Three Point Jumper.,"[{'name': 'Brycen Blaine', 'id': 15242}]",False,47.0,Charleston Southern,Big South,74.0,East Carolina,American,"{'shooter': {'name': 'Brycen Blaine', 'id': 15..."
7724,39739600,401820809114744829,213366,401820809,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,Defensive Rebound,2,...,Jordan Riley Defensive Rebound.,"[{'name': 'Jordan Riley', 'id': 2167}]",True,74.0,East Carolina,American,47.0,Charleston Southern,Big South,
7725,39739601,401820809114744831,213366,401820809,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,LayUpShot,2,...,Jordan Riley made Layup.,"[{'name': 'Jordan Riley', 'id': 2167}]",True,74.0,East Carolina,American,47.0,Charleston Southern,Big South,"{'shooter': {'name': 'Jordan Riley', 'id': 216..."
7726,39739603,401820809114745219,213366,401820809,2025-11-22 00:00:00+00:00,2026,SeasonType.REGULAR,STD,End Game,2,...,End of Game,[],,,,,,,,


### 4b. Method 2: Get Plays by Date Range

This retrieves all plays for all games within a date range - useful for analyzing a specific time period

In [None]:
# Define date range for play-by-play collection
start_date = '2025-11-04'
end_date = '2025-11-07'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get plays for all games in date range
        plays_by_date = plays_api.get_plays_by_date(
            season=season,
            start_date=start_date,
            end_date=end_date
        )
        
        # Convert to DataFrame
        plays_date_df = pd.DataFrame([p.to_dict() for p in plays_by_date])
        
        print(f"Total plays from {start_date} to {end_date}: {len(plays_date_df)}")
        print(f"\nUnique games: {plays_date_df['game_id'].nunique() if len(plays_date_df) > 0 else 0}")
        print(f"Play types: {plays_date_df['type'].value_counts().head(10) if len(plays_date_df) > 0 else 'No data'}")
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_date: {e}")
        plays_date_df = pd.DataFrame()

### 4c. Method 3: Get Plays by Team

This retrieves all plays for a specific team's entire season - ideal for team-specific analysis

In [None]:
# Get play-by-play for a specific team (example: Duke)
team_name = 'Duke'

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get all plays for the team's season
        team_plays = plays_api.get_plays_by_team(
            season=season,
            team=team_name
        )
        
        # Convert to DataFrame
        team_plays_df = pd.DataFrame([p.to_dict() for p in team_plays])
        
        print(f"Total plays for {team_name} in {season-1}/{season} season: {len(team_plays_df)}")
        if len(team_plays_df) > 0:
            print(f"\nGames played: {team_plays_df['game_id'].nunique()}")
            print(f"\nPlay type distribution:")
            print(team_plays_df['type'].value_counts())
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_team: {e}")
        team_plays_df = pd.DataFrame()

### 4d. Method 4: Get Shooting Plays Only

Filter for only shooting plays - essential for shot charts and shooting efficiency analysis

In [None]:
# Get shooting plays only for a team
with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        # Get shooting plays only
        shooting_plays = plays_api.get_plays_by_team(
            season=season,
            team=team_name,
            shooting_plays_only=True  # This filters for only shooting plays
        )
        
        # Convert to DataFrame
        shooting_df = pd.DataFrame([p.to_dict() for p in shooting_plays])
        
        print(f"Shooting plays for {team_name}: {len(shooting_df)}")
        if len(shooting_df) > 0:
            print(f"\nShot location data available: {'shot' in shooting_df.columns}")
            print(f"\nSample shooting plays:")
            print(shooting_df[['period', 'clock', 'team', 'type', 'description']].head())
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_plays_by_team: {e}")
        shooting_df = pd.DataFrame()

## 5. Process a Single Game (Start Here!)

Before processing all games, let's focus on getting ONE game right. This approach helps us:
- Understand the exact data structure
- Debug any parsing issues
- Validate our output before scaling up

In [None]:
# Step 5a: Pick ONE game and examine its structure
# This is crucial - we need to understand the data before processing it

# Get unique game IDs
if 'plays_df' in locals() and len(plays_df) > 0:
    unique_games = plays_df['gameId'].unique()
    print(f"Total unique games: {len(unique_games)}")
    
    # Pick the first game
    sample_game_id = unique_games[0]
    print(f"\nSelected game ID: {sample_game_id}")
    
    # Filter for just this game
    single_game_df = plays_df[plays_df['gameId'] == sample_game_id].copy()
    print(f"Total plays in this game: {len(single_game_df)}")
    
    # Get teams playing
    teams = single_game_df[single_game_df['team'].notna()]['team'].unique()
    print(f"Teams: {teams}")
    
    # Show all columns available
    print(f"\nAll columns ({len(single_game_df.columns)}):")
    for col in single_game_df.columns:
        print(f"  - {col}")
else:
    print("No plays_df available - run a data collection method first (4a, 4b, 4c, or 4d)")

In [None]:
# Step 5b: Examine the shotInfo structure for ONE shooting play
# This is where the previous code failed - shotInfo is a DICTIONARY, not an object!

# Get shooting plays from our single game
shooting_plays_single = single_game_df[single_game_df['shootingPlay'] == True]
print(f"Shooting plays in this game: {len(shooting_plays_single)}")

if len(shooting_plays_single) > 0:
    # Get the first shooting play with shotInfo
    sample_shot = shooting_plays_single[shooting_plays_single['shotInfo'].notna()].iloc[0]
    
    print(f"\n{'='*60}")
    print("SAMPLE SHOOTING PLAY")
    print(f"{'='*60}")
    print(f"Play Type: {sample_shot['playType']}")
    print(f"Play Text: {sample_shot['playText']}")
    print(f"Team: {sample_shot['team']}")
    
    print(f"\n{'='*60}")
    print("shotInfo STRUCTURE (this is the key!)")
    print(f"{'='*60}")
    print(f"Type: {type(sample_shot['shotInfo'])}")
    print(f"\nFull shotInfo content:")
    pprint(sample_shot['shotInfo'])
    
    # Show all available keys
    if isinstance(sample_shot['shotInfo'], dict):
        print(f"\nAvailable keys in shotInfo:")
        for key in sample_shot['shotInfo'].keys():
            print(f"  - {key}: {type(sample_shot['shotInfo'][key])}")
else:
    print("No shooting plays found in this game")

In [None]:
# Step 5c: CORRECT parsing - shotInfo is a DICTIONARY, use .get() not hasattr()!
# Process all shooting plays for our single game
import re

def extract_shot_data(row):
    """Extract all shot data from a single play row - handles dictionary structure correctly."""
    shot_info = row.get('shotInfo')
    play_text = row.get('playText') or ''
    
    # Start with basic play info
    shot_data = {
        'play_id': row.get('id'),
        'game_id': row.get('gameId'),
        'period': row.get('period'),
        'clock': row.get('clock'),
        'seconds_remaining': row.get('secondsRemaining'),
        'team': row.get('team'),
        'team_id': row.get('teamId'),
        'opponent': row.get('opponent'),
        'is_home_team': row.get('isHomeTeam'),
        'play_type': row.get('playType'),
        'play_text': play_text,
        'score_value': row.get('scoreValue'),
        'home_score': row.get('homeScore'),
        'away_score': row.get('awayScore'),
    }
    
    # Parse from play text
    # Distance: "24-foot" -> 24
    distance_match = re.search(r'(\d+)-foot', play_text)
    shot_data['distance'] = int(distance_match.group(1)) if distance_match else None
    
    # Three pointer: "three point" (with space)
    shot_data['is_three_from_text'] = 'three point' in play_text.lower()
    
    # Made/missed from text
    shot_data['made_from_text'] = ' makes ' in play_text.lower() or ' made ' in play_text.lower()
    
    # Extract from shotInfo dictionary (THIS IS THE KEY FIX!)
    if shot_info and isinstance(shot_info, dict):
        # Basic shot info
        shot_data['made'] = shot_info.get('made')
        shot_data['assisted'] = shot_info.get('assisted')
        shot_data['three_pointer'] = shot_info.get('threePointer')
        
        # Shooter info (nested dict)
        shooter = shot_info.get('shooter')
        if shooter and isinstance(shooter, dict):
            shot_data['shooter_name'] = shooter.get('name')
            shot_data['shooter_id'] = shooter.get('id')
        
        # Location info (nested dict)
        location = shot_info.get('location')
        if location and isinstance(location, dict):
            shot_data['x'] = location.get('x')
            shot_data['y'] = location.get('y')
    
    return shot_data

# Process our single game
print(f"Processing game: {sample_game_id}")
print(f"Shooting plays to process: {len(shooting_plays_single)}")

# Extract shot data for each shooting play
single_game_shots = []
for idx, row in shooting_plays_single.iterrows():
    shot_data = extract_shot_data(row.to_dict())
    single_game_shots.append(shot_data)

# Convert to DataFrame
single_game_shots_df = pd.DataFrame(single_game_shots)

print(f"\n{'='*60}")
print("EXTRACTION RESULTS")
print(f"{'='*60}")
print(f"Total shots extracted: {len(single_game_shots_df)}")
print(f"\nColumns extracted ({len(single_game_shots_df.columns)}):")
print(single_game_shots_df.columns.tolist())

# Check what we got
print(f"\n{'='*60}")
print("DATA QUALITY CHECK")
print(f"{'='*60}")
for col in ['made', 'distance', 'x', 'y', 'shooter_name', 'three_pointer', 'is_three_from_text']:
    if col in single_game_shots_df.columns:
        non_null = single_game_shots_df[col].notna().sum()
        if col == 'is_three_from_text':
            # Boolean - count True values
            true_count = single_game_shots_df[col].sum()
            print(f"  {col}: {true_count} three-pointers found")
        else:
            print(f"  {col}: {non_null}/{len(single_game_shots_df)} ({non_null/len(single_game_shots_df)*100:.1f}%)")

In [None]:
# Step 5e: Get ACTUAL roster data using get_game_players API
# This gives us starters and all players for each team

with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    
    try:
        # Get players for our single game
        game_players = games_api.get_game_players(game_id=sample_game_id)
        
        # Convert to DataFrame
        game_players_df = pd.DataFrame([p.to_dict() for p in game_players])
        
        print(f"Players in game {sample_game_id}: {len(game_players_df)}")
        print(f"\nColumns available:")
        for col in game_players_df.columns:
            print(f"  - {col}")
        
    except ApiException as e:
        print(f"Exception: {e}")
        game_players_df = pd.DataFrame()

# Show sample data
if len(game_players_df) > 0:
    print(f"\n{'='*60}")
    print("SAMPLE PLAYER DATA")
    print(f"{'='*60}")
    print(game_players_df.head(10).to_string())

In [None]:
# Step 5f: Extract starters from game_players_df
# Check if there's a 'starter' column or similar

if len(game_players_df) > 0:
    print("CHECKING FOR STARTER INFO")
    print("="*60)
    
    # Look for starter-related columns
    starter_cols = [col for col in game_players_df.columns if 'start' in col.lower()]
    print(f"Starter-related columns: {starter_cols}")
    
    # Get unique teams
    if 'team' in game_players_df.columns:
        teams = game_players_df['team'].unique()
        print(f"\nTeams: {teams}")
        
        for team in teams:
            team_players = game_players_df[game_players_df['team'] == team]
            print(f"\n{team}:")
            print(f"  Total players: {len(team_players)}")
            
            # Check for starter column
            if 'starter' in game_players_df.columns:
                starters = team_players[team_players['starter'] == True]
                print(f"  Starters: {len(starters)}")
                if 'name' in starters.columns:
                    for idx, p in starters.iterrows():
                        print(f"    - {p['name']}")
            elif 'isStarter' in game_players_df.columns:
                starters = team_players[team_players['isStarter'] == True]
                print(f"  Starters: {len(starters)}")
else:
    print("No game_players_df data")

In [None]:
# Step 5f: Build lineup tracker
# Parse substitutions to track who's on court

import re

def parse_substitution(play_text):
    """Parse 'Player In for Player Out' from substitution text."""
    # Common patterns: "John Smith enters for Mike Jones"
    #                  "John Smith in for Mike Jones"
    pattern = r'(.+?)\s+(?:enters|in)\s+for\s+(.+?)\.?$'
    match = re.search(pattern, play_text, re.IGNORECASE)
    if match:
        return {'in': match.group(1).strip(), 'out': match.group(2).strip()}
    return None

# Test on sample subs
print("Testing substitution parsing:")
print("="*60)
for idx, sub in subs.head(5).iterrows():
    result = parse_substitution(sub['playText'])
    print(f"Text: {sub['playText']}")
    print(f"Parsed: {result}")
    print()

In [None]:
# Step 5g: Track full lineups throughout the game
# Build a mapping of play -> players on court for each team

def get_starters_from_early_plays(game_df, team_name, num_plays=20):
    """Infer starters by looking at first players involved in plays."""
    team_plays = game_df[game_df['team'] == team_name].head(num_plays)
    players = set()
    
    for idx, play in team_plays.iterrows():
        participants = play.get('participants', [])
        if isinstance(participants, list):
            for p in participants:
                if isinstance(p, dict) and 'name' in p:
                    players.add(p['name'])
        
        # Also check shotInfo for shooter
        shot_info = play.get('shotInfo')
        if shot_info and isinstance(shot_info, dict):
            shooter = shot_info.get('shooter')
            if shooter and isinstance(shooter, dict):
                players.add(shooter.get('name'))
    
    return players

def track_lineups(game_df):
    """Track which 5 players are on court for each team throughout the game."""
    # Get teams
    teams = game_df[game_df['team'].notna()]['team'].unique()
    print(f"Teams: {teams}")
    
    # Initialize lineups with players we see early (best guess at starters)
    lineups = {}
    for team in teams:
        lineups[team] = get_starters_from_early_plays(game_df, team)
        print(f"\n{team} initial players spotted: {len(lineups[team])}")
        for p in list(lineups[team])[:5]:
            print(f"  - {p}")
    
    # Process game chronologically and track subs
    lineup_log = []
    
    for idx, play in game_df.iterrows():
        team = play.get('team')
        play_type = play.get('playType')
        
        # Process substitution
        if play_type == 'Substitution' and team:
            sub_info = parse_substitution(play.get('playText', ''))
            if sub_info:
                # Remove player going out, add player coming in
                if sub_info['out'] in lineups.get(team, set()):
                    lineups[team].discard(sub_info['out'])
                lineups[team].add(sub_info['in'])
        
        # Log current lineup for this play
        lineup_log.append({
            'play_id': play.get('id'),
            'period': play.get('period'),
            'clock': play.get('clock'),
            'home_lineup': list(lineups.get(teams[0], set())) if len(teams) > 0 else [],
            'away_lineup': list(lineups.get(teams[1], set())) if len(teams) > 1 else [],
        })
    
    return pd.DataFrame(lineup_log), lineups

# Run lineup tracking
lineup_df, final_lineups = track_lineups(single_game_df)

print(f"\n{'='*60}")
print("LINEUP TRACKING RESULTS")
print(f"{'='*60}")
print(f"Total plays tracked: {len(lineup_df)}")

# Check lineup sizes
print(f"\nLineup size check (should be ~5 per team):")
for col in ['home_lineup', 'away_lineup']:
    sizes = lineup_df[col].apply(len)
    print(f"  {col}: min={sizes.min()}, max={sizes.max()}, avg={sizes.mean():.1f}")

In [13]:
# Step 5: Extract shot locations from shooting plays
if len(plays_df) > 0:
    # Filter for shooting plays only
    shooting_plays = plays_df[plays_df['shootingPlay'] == True].copy()
    
    print(f"\nFound {len(shooting_plays)} shooting plays out of {len(plays_df)} total plays")
    
    if len(shooting_plays) > 0:
        # Extract x, y coordinates from shotInfo
        shot_locations = []
        
        for idx, row in shooting_plays.iterrows():
            if row['shotInfo'] is not None:
                shot_info = row['shotInfo']
                
                # Extract location data if available
                location_data = {
                    'game_id': row['gameId'],
                    'period': row['period'],
                    'clock': row['clock'],
                    'team': row['team'],
                    'opponent': row['opponent'],
                    'play_type': row['playType'],
                    'play_text': row['playText'],
                    'score_value': row['scoreValue'],
                    'home_score': row['homeScore'],
                    'away_score': row['awayScore']
                }
                
                # Add shot-specific info if available
                if hasattr(shot_info, 'made'):
                    location_data['made'] = shot_info.made
                if hasattr(shot_info, 'shotType'):
                    location_data['shot_type'] = shot_info.shotType
                if hasattr(shot_info, 'distance'):
                    location_data['distance'] = shot_info.distance
                    
                # Add coordinates if available
                if hasattr(shot_info, 'location') and shot_info.location is not None:
                    if hasattr(shot_info.location, 'x'):
                        location_data['x'] = shot_info.location.x
                    if hasattr(shot_info.location, 'y'):
                        location_data['y'] = shot_info.location.y
                
                shot_locations.append(location_data)
        
        shot_locations_df = pd.DataFrame(shot_locations)
        
        print(f"\nExtracted {len(shot_locations_df)} shots with detailed data")
        print(f"\nShot locations columns: {shot_locations_df.columns.tolist()}")
        print(f"\nSample shot data:")
        print(shot_locations_df.head(10))
        
        # Show some statistics
        if 'made' in shot_locations_df.columns:
            made_count = shot_locations_df['made'].sum()
            total_shots = len(shot_locations_df)
            print(f"\nShooting statistics:")
            print(f"  Total shots: {total_shots}")
            print(f"  Made: {made_count}")
            print(f"  Field Goal %: {(made_count/total_shots*100):.1f}%")
        
        if 'x' in shot_locations_df.columns and 'y' in shot_locations_df.columns:
            coords_available = shot_locations_df[['x', 'y']].notna().all(axis=1).sum()
            print(f"  Shots with coordinates: {coords_available} ({coords_available/total_shots*100:.1f}%)")
    else:
        print("No shooting plays found in the data")
        shot_locations_df = pd.DataFrame()
else:
    print("No play-by-play data available")
    shot_locations_df = pd.DataFrame()


Found 3746 shooting plays out of 11191 total plays

Extracted 3746 shots with detailed data

Shot locations columns: ['game_id', 'period', 'clock', 'team', 'opponent', 'play_type', 'play_text', 'score_value', 'home_score', 'away_score']

Sample shot data:
   game_id  period  clock                    team                opponent  \
0   210412       1  19:16  Long Island University                 Georgia   
1   210412       1  19:11  Long Island University                 Georgia   
2   210412       1  19:07                 Georgia  Long Island University   
3   210412       1  18:56  Long Island University                 Georgia   
4   210412       1  18:38  Long Island University                 Georgia   
5   210412       1  18:18                 Georgia  Long Island University   
6   210412       1  18:01  Long Island University                 Georgia   
7   210412       1  18:01  Long Island University                 Georgia   
8   210412       1  18:01  Long Island University 

## 6. Save Data to CSV

Export your collected play-by-play data for further analysis or backup

In [19]:
# Step 6: Save play-by-play data to CSV
output_dir = 'cbbd_data'

# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Create a timestamp for the filename
date_str = search_date.strftime('%Y%m%d')

# Save games data
if 'games_df' in locals() and len(games_df) > 0:
    filename = f'{output_dir}/games_{date_str}_{season}.csv'
    games_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(games_df)} games to {filename}")

# Save all plays data
if 'plays_df' in locals() and len(plays_df) > 0:
    filename = f'{output_dir}/plays_{date_str}_{season}.csv'
    plays_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(plays_df):,} plays to {filename}")

# Save shot locations data
if 'shot_locations_df' in locals() and len(shot_locations_df) > 0:
    filename = f'{output_dir}/shot_locations_{date_str}_{season}.csv'
    shot_locations_df.to_csv(filename, index=False)
    print(f"✓ Saved {len(shot_locations_df)} shot locations to {filename}")

print(f"\n{'='*50}")
print(f"All data saved to {output_dir}/")
print(f"{'='*50}")

Created directory: cbbd_data
✓ Saved 23 games to cbbd_data/games_20251230_2026.csv
✓ Saved 11,191 plays to cbbd_data/plays_20251230_2026.csv
✓ Saved 3746 shot locations to cbbd_data/shot_locations_20251230_2026.csv

All data saved to cbbd_data/


## 7. Data Quality Check and Summary Statistics

Review your collected data to ensure quality and completeness

In [20]:
# Summary statistics for collected data
if 'team_plays_df' in locals() and len(team_plays_df) > 0:
    print("=" * 50)
    print("DATA SUMMARY")
    print("=" * 50)
    print(f"\nTeam: {team_name}")
    print(f"Season: {season-1}/{season}")
    print(f"Total Plays: {len(team_plays_df):,}")
    print(f"Total Games: {team_plays_df['game_id'].nunique()}")
    
    print("\nPlay Type Distribution:")
    print(team_plays_df['type'].value_counts())
    
    # Check for missing data
    print("\nMissing Data Check:")
    print(team_plays_df.isnull().sum())
    
    # Period distribution
    if 'period' in team_plays_df.columns:
        print("\nPlays by Period:")
        print(team_plays_df['period'].value_counts().sort_index())
    
    print("\n" + "=" * 50)

# Appendix

---

The following sections provide additional functionality and examples that you may find useful but are not required for the basic workflow.

## Appendix A: Get Play Types Reference

Understanding what play types are available in the data

In [26]:
# Get all available play types
with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    try:
        play_types = plays_api.get_play_types()
        
        # Convert to DataFrame
        play_types_df = pd.DataFrame([pt.to_dict() for pt in play_types])
        
        print("Available Play Types:")
        print(f"Total play types available: {len(play_types_df)}")
        
    except ApiException as e:
        print(f"Exception when calling PlaysApi->get_play_types: {e}")
        play_types_df = pd.DataFrame()

# SQL-style: SELECT playType, COUNT(*) FROM plays_df GROUP BY playType
if 'plays_df' in locals() and len(plays_df) > 0:
    play_type_counts = plays_df.groupby('playType').size().reset_index(name='count')
    play_type_counts = play_type_counts.sort_values('count', ascending=False)

# Display in data editor - just call the variable with no print()
play_type_counts

Available Play Types:
Total play types available: 34


Unnamed: 0,playType,count
19,Substitution,3320
8,JumpShot,1649
4,Defensive Rebound,1150
12,MadeFreeThrow,964
10,LayUpShot,962
15,PersonalFoul,805
11,Lost Ball Turnover,581
13,Offensive Rebound,509
18,Steal,324
14,OfficialTVTimeOut,183


## Appendix B: Batch Collection for Multiple Teams

Automate data collection for multiple teams at once

In [None]:
# Define teams to collect data for
teams = ['Duke', 'North Carolina', 'Kansas', 'Kentucky', 'Gonzaga']

# Dictionary to store all team data
all_team_data = {}

with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    
    for team in teams:
        try:
            print(f"\nCollecting data for {team}...")
            
            # Get all plays for the team
            team_plays = plays_api.get_plays_by_team(
                season=season,
                team=team
            )
            
            # Convert to DataFrame
            df = pd.DataFrame([p.to_dict() for p in team_plays])
            all_team_data[team] = df
            
            print(f"  Collected {len(df)} plays for {team}")
            
            # Optional: add a small delay to avoid rate limiting
            import time
            time.sleep(0.5)
            
        except ApiException as e:
            print(f"  Error collecting data for {team}: {e}")

print(f"\nTotal teams collected: {len(all_team_data)}")

## Appendix C: Calculate Team Shooting Efficiency

Example analysis: Calculate shooting statistics from play-by-play data

In [None]:
# Calculate shooting statistics from play-by-play data
if 'shooting_df' in locals() and len(shooting_df) > 0:
    # Filter for made/missed shots
    shot_results = shooting_df[shooting_df['type'].str.contains('shot|three|jumper|layup|dunk', case=False, na=False)]
    
    if len(shot_results) > 0:
        # Calculate shooting percentages
        total_shots = len(shot_results)
        made_shots = len(shot_results[shot_results.get('made', False) == True])
        
        fg_pct = (made_shots / total_shots * 100) if total_shots > 0 else 0
        
        print(f"\nShooting Statistics for {team_name}:")
        print(f"  Total Shot Attempts: {total_shots}")
        print(f"  Made Shots: {made_shots}")
        print(f"  Field Goal %: {fg_pct:.2f}%")
        
        # Three-point shooting
        three_pt_shots = shot_results[shot_results['type'].str.contains('three', case=False, na=False)]
        if len(three_pt_shots) > 0:
            three_made = len(three_pt_shots[three_pt_shots.get('made', False) == True])
            three_pct = (three_made / len(three_pt_shots) * 100)
            print(f"\n  Three-Point Attempts: {len(three_pt_shots)}")
            print(f"  Three-Pointers Made: {three_made}")
            print(f"  Three-Point %: {three_pct:.2f}%")

## Appendix D: Tips and Best Practices

### Rate Limiting
- Be mindful of API rate limits when making bulk requests
- Add small delays between requests when collecting data for multiple teams
- Use the `_with_http_info` method variant to check remaining API calls:

```python
response = plays_api.get_plays_by_team_with_http_info(season=season, team=team)
remaining_calls = response.headers.get('X-CallLimit-Remaining')
```

### Data Storage
- Save data incrementally to avoid losing progress
- Use parquet format for large datasets (more efficient than CSV)
- Include timestamps in filenames for version control

### Season Timing
- NCAA basketball season typically runs November through April
- Use `season=2026` for the 2025/26 academic year
- Check game schedules before attempting to collect data

### Error Handling
- Always wrap API calls in try-except blocks
- Log errors for debugging
- Implement retry logic for transient failures

---

## Next Steps

Now that you have play-by-play data, you can:
1. Create shot charts using matplotlib/seaborn
2. Calculate advanced metrics (offensive rating, pace, etc.)
3. Analyze lineup performance
4. Build predictive models
5. Track player development over the season

For shot chart creation, check out the CBBD blog post: https://blog.collegefootballdata.com/talking-tech-generating-shot-charts-using-the-basketball-api/

---

**End of Notebook**