NBA Player Rating Engine

A hybrid player rating system combining Regularized Adjusted Plus-Minus (RAPM) with Elo-style incremental updates to produce dynamic, three-dimensional player ratings (Offense, Defense, Pace) for every NBA player.

The player ratings are the core product. Downstream applications (spread prediction, live in-game trading on Kalshi/Polymarket, lineup optimization) consume ratings as inputs.

Architecture Summary

stats.nba.com ──► pbpstats library ──► Local Cache ──► ETL ──► SQLite
                                                                  │
                                              ┌───────────────────┤
                                              ▼                   ▼
                                         Stint-Level         Possession-Level
                                         RAPM (Ridge)        Elo Updates
                                              │                   │
                                              └───────┬───────────┘
                                                      ▼
                                              current_ratings
                                              (3D per player)
                                                      │
                                              ┌───────┼───────────┐
                                              ▼       ▼           ▼
                                          Spreads   Live Model   Lineup Opt.

Tech Stack

Python 3.11+
pbpstats — play-by-play parsing with lineup attribution (primary data source)
nba_api — supplementary data (schedules, player metadata)
SQLite — all processed data and ratings
scikit-learn — ridge regression
numpy / pandas — matrix construction, data manipulation

Data Source

Primary: pbpstats Python library by dblackrun. Fetches raw play-by-play from stats.nba.com, parses into possessions with full lineup attribution, caches locally.

Fallback: REST API at api.pbpstats.com — pre-computed lineup/stint aggregates via get-lineup-opponent-summary. Sufficient for RAPM but not per-possession Elo.

Scope: 2024-25 season (complete). Add 2025-26 after validation.

Project Breakdown

This project is built in discrete phases. Complete each project fully before starting the next. Each project has clear inputs, outputs, and validation criteria.

Project 0: Repo Setup

Goal: Scaffold the repo, install dependencies, create the database.

Tasks:

Create the directory structure (see Repo Structure below)
Create requirements.txt with: pbpstats, nba_api, scikit-learn, numpy, pandas
Create config.py with all constants (see Config section)
Create db/schema.sql with the full schema (see Schema section)
Create db/init_db.py — reads schema.sql and creates the SQLite database at the configured path
Create data/ directory for pbpstats cache with .gitkeep
Create .gitignore — ignore *.db, data/, __pycache__/, .env, *.pyc, notebooks/.ipynb_checkpoints/

Validation: python db/init_db.py runs without error and creates the database with all tables.

Output: Empty database with schema applied, all directories in place.

Project 1: pbpstats Client

Goal: Build a wrapper around the pbpstats library that fetches and parses a single game into structured possession data with lineup attribution.

Tasks:

Create ingestion/pbpstats_client.py with a function parse_game(game_id: str) -> dict that:
- Configures pbpstats with data_directory pointing to config.DATA_DIR
- Fetches the game's play-by-play via pbpstats
- Returns a dict with:
  - game_id, season, game_date, home_team_id, away_team_id
  - possessions: list of dicts, each containing:
    - period, possession_number (sequential)
    - offense_team_id, defense_team_id
    - points_scored (total points on this possession including FTs)
    - fg2a, fg2m, fg3a, fg3m, turnovers, offensive_rebounds, free_throw_points
    - start_time, end_time, start_type, start_score_differential
    - offense_lineup_id (dash-separated sorted player IDs)
    - defense_lineup_id (dash-separated sorted player IDs)
    - Individual player IDs: off_player_1 through off_player_5, def_player_1 through def_player_5 (sorted)
Create ingestion/game_list.py with a function get_season_game_ids(season: str) -> list[str] that returns all game IDs for a season using nba_api or pbpstats.

Important implementation notes:

pbpstats Possession objects have .offense_team_id, .defense_team_id attributes
Lineup info is on the possession's OffenseLineup and DefenseLineup properties
Player IDs should be stored as strings, sorted ascending, to create deterministic lineup keys
The offense_lineup_id is the dash-separated sorted string of 5 player IDs (e.g., "201566-203507-203954-1629029-1630567")
Consult pbpstats docs/source for exact attribute names — they may vary between versions

Validation: Run parse_game("0022400001") (or any known 2024-25 game ID) and verify:

Returns possessions with valid lineup data
Total points across possessions matches the actual game score
Each possession has exactly 5 offensive and 5 defensive players
Lineup IDs are deterministic (same game parsed twice → same IDs)

Output: Working single-game parser. No database writes yet.

Project 2: ETL Pipeline

Goal: Transform parsed game data into database rows (possessions + stints) and insert into SQLite.

Tasks:

Create ingestion/etl.py with:
- insert_game(db_path: str, parsed_game: dict) -> None
  - Inserts into games table
  - Inserts each possession into possessions table
  - Aggregates possessions into 10-man stints and inserts into stints table
  - Upserts player metadata into players table
  - Uses transactions — entire game is atomic (commit or rollback)
- aggregate_stints(possessions: list[dict]) -> list[dict]
  - Groups possessions by (offense_lineup_id, defense_lineup_id)
  - For each group: sum possessions count, points_scored, compute seconds_played from time data
  - Compute offensive_rating = points_scored / possessions * 100
  - Returns list of stint dicts ready for DB insertion
Handle idempotency: if a game already exists in the DB, skip it (check games table)

Validation: Parse and ETL a single game. Then query:

-- Total points should match actual game score
SELECT offense_team_id, SUM(points_scored) FROM possessions WHERE game_id = ? GROUP BY offense_team_id;

-- Stints should cover all possessions
SELECT SUM(possessions) FROM stints WHERE game_id = ?;
-- Should equal total possessions in the game

-- Every stint should have exactly 10 unique players
-- (verify programmatically)

Output: Single-game ETL working end-to-end. Parse → transform → SQLite.

Project 3: Backfill Script

Goal: Backfill the entire 2024-25 season into the database.

Tasks:

Create ingestion/backfill.py that:
- Gets all game IDs for the 2024-25 season
- For each game (in chronological order):
  - Skip if already in games table
  - Parse via pbpstats_client.parse_game()
  - ETL via etl.insert_game()
  - Sleep 2-3 seconds between games (rate limiting for stats.nba.com)
  - Log progress: "[423/1230] Game 0022400423 — LAL vs BOS — 198 possessions, 14 stints"
- Handle errors gracefully: log failures, continue to next game, report summary at end
- Support resume: since ETL is idempotent, re-running picks up where it left off

After backfill completes, compute and insert league averages:

INSERT INTO league_averages (season, avg_ppp, avg_pace, total_possessions)
SELECT
    season,
    CAST(SUM(points_scored) AS REAL) / SUM(possessions),
    AVG(game_pace),  -- need to compute from games table
    SUM(possessions)
FROM stints
GROUP BY season;

Validation:

~1,200-1,230 games ingested for 2024-25
Total possessions across season is ~475K-525K (sanity check)
League average PPP is ~1.10-1.15 (sanity check)
No games with 0 possessions or 0 stints
Run SELECT COUNT(DISTINCT offense_lineup_id) FROM stints; — should be several thousand unique lineups

Output: Full 2024-25 season in SQLite. Ready for RAPM.

Project 4: RAPM Model (Full-Season)

Goal: Fit ridge regression on stint-level data to produce offense, defense, and pace ratings for every player.

Tasks:

Create models/rapm.py with:
- build_design_matrix(db_path: str, season: str) -> tuple[scipy.sparse.csr_matrix, np.array, np.array, list[str]]
  - Query all stints for the season
  - Build player index: map each unique player_id to a column index
  - For each stint row:
    - Set +1 at columns for the 5 offensive player indices
    - Set -1 at columns for the 5 defensive player indices
  - Use scipy.sparse — the matrix is very sparse (~10 nonzeros per row out of ~500+ columns)
  - Weight each row by sqrt(possessions)
  - Target y_offense: (points_scored / possessions - league_avg_ppp) * sqrt(possessions)
  - Target y_defense: (points_allowed / possessions - league_avg_ppp) * sqrt(possessions)
  - Return: X matrix, y_offense, y_defense, player_id list (column order)
- build_pace_target(db_path: str, season: str) -> np.array
  - Target: (possessions / seconds_played * 2880 - league_avg_pace) * sqrt(possessions)
  - Same design matrix X, different target
- fit_rapm(X, y, alpha=5000) -> np.array
  - sklearn.linear_model.Ridge(alpha=alpha, fit_intercept=False)
  - Returns coefficient array
- run_full_season_rapm(db_path: str, season: str, alpha=5000) -> None
  - Orchestrator: build matrix → fit offense → fit defense → fit pace
  - Insert results into rapm_ratings table
  - Update current_ratings table with phase = 'rapm_full'

Important design notes:

fit_intercept=False because our target is already centered on league average
The design matrix encodes offense as +1 and defense as -1, so a single regression produces ORAPM coefficients. For DRAPM, flip the sign convention or fit separately with defense as +1.
Actually, the cleanest approach: fit ONE regression where each row's target is the offensive team's margin per possession. The coefficient for a player captures their net impact when on offense (+) or defense (-). Then:
- offense_rating = coefficient when player is on offense = positive means good offense
- defense_rating = coefficient when player is on defense = negative means good defense (allows fewer points)
- To get both, fit two separate regressions: one for offensive possessions (target = points scored - avg), one for defensive possessions (target = points allowed - avg, sign-flipped so lower is better)
Alternative (simpler): fit one regression, coefficient = net impact. Then use on/off splits from the data to decompose. Start with net RAPM first, decompose later.

Validation:

Print top 20 and bottom 20 players by overall rating
Sanity check: stars (Jokic, SGA, Luka) should be near the top; end-of-bench guys near zero (not bottom — ridge shrinks them)
Coefficient distribution should be roughly normal, centered near 0
Correlation with public RAPM sources > 0.7

Output: rapm_ratings and current_ratings populated for all players in 2024-25.

Project 5: Rolling-Window RAPM

Goal: Replace full-season RAPM with a 30-game rolling window, re-fit nightly.

Tasks:

P5.1 — Rolling RAPM + Nightly Job Script

Create ingestion/ingest_daily.py:
- get_games_since(db_path: str, since_date: str) -> list[str]
  - Query games table for the max game_date already ingested
  - Fetch all game IDs from nba_api.LeagueGameLog on or after that date
  - Return only game IDs not yet in the games table
- ingest_new_games(db_path: str) -> int
  - Call get_games_since, parse + ETL each new game, return count of games ingested
Add to models/rapm.py:
- get_player_window(db_path: str, player_id: str, as_of_date: str, window_size=30) -> tuple[date, date]
  - Find the last 30 games this player appeared in, on or before as_of_date
  - Return (window_start_date, window_end_date)
- build_rolling_design_matrix(db_path: str, season: str, as_of_date: str, window_size=30)
  - Compute the union window: earliest window_start_date across all active players
  - Query stints where game_date >= union_start AND game_date <= as_of_date
  - Same matrix construction as full-season
- run_rolling_rapm(db_path: str, season: str, as_of_date: str, alpha=5000)
  - Fit and store rolling RAPM ratings with window metadata
Create pipeline/nightly_job.py:
- Ingest new games via ingest_daily.ingest_new_games()
- Recalculate league averages
- Run rolling RAPM as of today
- Update current_ratings with phase = 'rapm_rolling'
- Log a summary: games ingested, players updated, timestamp

P5.2 — Scheduling Infrastructure 4. Set up Windows Task Scheduler to run nightly_job.py automatically:

Create scripts/run_nightly.bat — a batch file that activates the uv environment and runs nightly_job.py with PYTHONPATH set correctly
Create scripts/install_task.ps1 — a PowerShell script that registers the task in Windows Task Scheduler:
- Trigger: daily at 4:00 AM (after games have finished and data is available)
- Action: run run_nightly.bat
- Working directory: project root
- On failure: retry once after 30 minutes
Document the setup steps in a SETUP.md (or section in README) so it can be re-installed on a new machine
Log output to logs/nightly_YYYY-MM-DD.log — nightly_job.py should write structured log lines, and the batch file should redirect stdout/stderr to a dated log file

Validation:

Rolling ratings should be correlated with but not identical to full-season ratings
Players who had a strong recent stretch should rate higher in rolling vs full-season
Window metadata in rapm_ratings should show correct date ranges
Run scripts/install_task.ps1 and verify the task appears in Task Scheduler; trigger it manually and confirm logs/ gets a log file with expected output

Output: Nightly-updatable rolling RAPM pipeline that runs automatically each morning.

Project 6: Elo Layer

Goal: Add per-possession Elo updates between RAPM re-fits.

Tasks:

Create models/elo.py:
- sigmoid(x: float) -> float — standard logistic function
- elo_update(possession: dict, current_elos: dict, league_avg_ppp: float, K=2.0) -> dict
  - Compute expected outcome from current Elo ratings of 10 players
  - Compute surprise (actual - expected)
  - Return updated Elo deltas for all 10 players
- replay_game_elo(db_path: str, game_id: str, current_elos: dict, league_avg_ppp: float, K=2.0) -> dict
  - Fetch all possessions for game in chronological order
  - Apply Elo updates sequentially
  - Apply game-level pace Elo update after all possessions
  - Store Elo snapshots in elo_ratings table
  - Return updated elos
- reset_elo_to_rapm(db_path: str) -> dict
  - Load latest RAPM ratings
  - Return elo dict with all deltas = 0, bases = RAPM values
Create models/composite.py:
- update_current_ratings(db_path: str) -> None
  - For each player: offense = rapm_base + elo_delta, same for defense, pace
  - Update current_ratings with phase = 'elo'
Integrate into nightly_job.py:
- After RAPM re-fit: reset Elo → replay today's games → update composite ratings

Validation:

Elo deltas should be small relative to RAPM base (< 10% magnitude)
After a blowout win, offensive players' Elo should tick up, defensive opponents' should tick down
Calibration plot: bin possessions by predicted probability, plot actual scoring rate. Should be roughly diagonal.

Output: Full hybrid rating system: RAPM base + Elo adjustments, updated per-possession.

Config

# config.py
import os

# Paths
PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))
DB_PATH = os.path.join(PROJECT_ROOT, "db", "nba_ratings.db")
DATA_DIR = os.path.join(PROJECT_ROOT, "data")
SCHEMA_PATH = os.path.join(PROJECT_ROOT, "db", "schema.sql")

# Seasons
INITIAL_SEASON = "2024-25"
SEASON_TYPE = "Regular Season"

# RAPM
RIDGE_ALPHA = 5000
ROLLING_WINDOW_GAMES = 30

# Elo
ELO_K_OFFENSE_DEFENSE = 2.0
ELO_K_PACE = 1.0

# Ingestion
BACKFILL_SLEEP_SECONDS = 2.5

Schema

See db/schema.sql — full schema is in the spec document. Key tables:

Table	Purpose	Phase
`possessions`	Per-possession data with 10-man lineups	P1 (populated), P6 (consumed by Elo)
`stints`	Aggregated 10-man matchup data	P1 (populated), P4 (consumed by RAPM)
`games`	Game metadata, used for idempotency	P1
`players`	Player metadata	P1
`league_averages`	Season-level PPP and pace	P3
`rapm_ratings`	RAPM coefficients with window metadata	P4-5
`elo_ratings`	Per-possession Elo snapshots	P6
`current_ratings`	Best current rating per player	P4+

Repo Structure

nba-rating-engine/
├── README.md
├── CLAUDE.md
├── requirements.txt
├── config.py
├── .gitignore
├── db/
│   ├── schema.sql
│   ├── init_db.py
│   └── nba_ratings.db          (gitignored)
├── data/                        (gitignored — pbpstats cache)
│   └── .gitkeep
├── ingestion/
│   ├── __init__.py
│   ├── pbpstats_client.py
│   ├── game_list.py
│   ├── etl.py
│   ├── backfill.py
│   └── ingest_daily.py
├── models/
│   ├── __init__.py
│   ├── rapm.py
│   ├── elo.py
│   └── composite.py
├── pipeline/
│   ├── __init__.py
│   ├── nightly_job.py
│   └── phase1_full_season.py
├── analysis/
│   ├── validate_ratings.py
│   ├── calibration.py
│   └── notebooks/
│       ├── rapm_exploration.ipynb
│       └── elo_tuning.ipynb
├── downstream/
│   ├── __init__.py
│   ├── spread_model.py
│   ├── live_model.py
│   └── lineup_optimizer.py
└── tests/
    ├── __init__.py
    ├── test_etl.py
    ├── test_rapm.py
    ├── test_elo.py
    └── test_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Player Rating Engine

Architecture Summary

Tech Stack

Data Source

Project Breakdown

Project 0: Repo Setup

Project 1: pbpstats Client

Project 2: ETL Pipeline

Project 3: Backfill Script

Project 4: RAPM Model (Full-Season)

Project 5: Rolling-Window RAPM

Project 6: Elo Layer

Config

Schema

Repo Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.claude/commands		.claude/commands
analysis		analysis
db		db
downstream		downstream
ingestion		ingestion
logs		logs
memory		memory
models		models
pipeline		pipeline
scripts		scripts
src/nba_clv_modeling		src/nba_clv_modeling
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

NBA Player Rating Engine

Architecture Summary

Tech Stack

Data Source

Project Breakdown

Project 0: Repo Setup

Project 1: pbpstats Client

Project 2: ETL Pipeline

Project 3: Backfill Script

Project 4: RAPM Model (Full-Season)

Project 5: Rolling-Window RAPM

Project 6: Elo Layer

Config

Schema

Repo Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages