BattleBots AI Fight Predictor

Scrape BattleBots data with Bright Data, store it in SQLite, enrich Reddit fan sentiment with a basic RAG pipeline, and generate AI scouting reports for any bot matchup — all through a FastAPI backend and React UI.

What it does

Scrape bot profiles, fight history, and Reddit posts/comments via Bright Data Web Unlocker
Store everything in a local SQLite database
Embed Reddit text into vector chunks for retrieval at prediction time
Predict winners with an LLM (OpenAI or Anthropic), citing numbered evidence facts
Explore the data and read reports in the web app

Project status

Area	Status
Bright Data scrapers (bots, matches, sentiment)	Done
Reddit deep scrape (pagination, comments, multi-subreddit)	Done
RAG pipeline (embed + retrieve Reddit chunks)	Done
LLM prediction engine with evidence catalog	Done
FastAPI REST API	Done
React + Vite frontend	Done

Layout

backend/
├── .env.example
├── config.py                 # Pydantic settings
├── main.py                   # FastAPI app
├── predictor.py              # Matchup prediction orchestration
├── ai/
│   ├── llm_client.py         # OpenAI / Anthropic scouting reports
│   ├── embeddings.py         # OpenAI embedding calls
│   ├── rag.py                # Chunk indexing + similarity retrieval
│   ├── index_sentiment.py    # Re-embed existing sentiment without re-scraping
│   ├── evidence.py           # Numbered fact catalog for LLM citations
│   └── prompts.py
├── api/                      # REST routes (bots, predict, explorer, logs, stats)
├── db/
│   ├── schema.py             # bots, matches, sentiment, sentiment_chunks, predictions
│   ├── database.py
│   └── repositories.py
├── scrapers/
│   ├── brightdata_client.py
│   ├── scrape_bots.py
│   ├── scrape_matches.py
│   ├── scrape_sentiment.py   # Reddit posts + comments + RAG indexing
│   ├── run_all_scrapers.py
│   └── parsers/
└── tests/

frontend/
├── src/
│   ├── pages/                # Home, predictions list, prediction detail
│   ├── components/           # Bot picker, scouting report, data explorer, …
│   └── lib/api.ts            # Calls backend via /api proxy in dev
└── vite.config.ts

Quick start

1. Install dependencies

uv is recommended. From the repo root:

uv sync --extra dev
cd frontend && npm install && cd ..

2. Configure secrets

cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env   # optional in local dev

Edit backend/.env:

Required for scrapers	Variable
Yes	`BRIGHTDATA_API_TOKEN`
Yes	`BRIGHTDATA_WEB_UNLOCKER_ZONE`

Required for predictions	Variable
One of	`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`
Set	`LLM_PROVIDER=openai` or `anthropic`

Required for RAG	Variable
Yes	`OPENAI_API_KEY` (uses `text-embedding-3-small`)
Optional	`RAG_ENABLED=true` (default)

You can run scrapers with Bright Data only. Predictions need an LLM key. RAG needs an OpenAI key for embeddings (predictions still work without RAG, using a fallback slice of stored posts).

3. Scrape data

# All three stages: bots → matches → sentiment (+ RAG indexing)
uv run python -m backend.scrapers.run_all_scrapers

# Sentiment only (Reddit + embed for RAG)
uv run python -m backend.scrapers.scrape_sentiment

# Re-embed existing sentiment without re-scraping
uv run python -m backend.ai.index_sentiment

Quick smoke test:

uv run python -m backend.scrapers.run_all_scrapers --limit-bots 3

Database path: backend/data/battlebots.db

4. Run the app

Terminal 1 — API:

uv run uvicorn backend.main:app --reload --host 127.0.0.1 --port 8000

Terminal 2 — frontend:

cd frontend && npm run dev

Open http://localhost:5173/ — Vite proxies /api to the backend, so no CORS setup is needed in local dev.

API docs: http://127.0.0.1:8000/docs

API overview

Method	Path	Purpose
GET	`/health`	Health check
GET	`/stats`	Row counts + active LLM provider/model
GET	`/bots`	Bot roster
GET	`/bots/{id}`	Bot profile, matches, sentiment
POST	`/predict`	Generate scouting report for two bots
GET	`/predictions`	Recent predictions
GET	`/predictions/{id}`	Full prediction detail
GET	`/explorer/*`	Paginated bots, matches, sentiment, predictions
GET	`/logs`	Live scraper/server log stream

Bright Data setup

Scrapers call Bright Data's Web Unlocker API:

POST https://api.brightdata.com/request
Authorization: Bearer <BRIGHTDATA_API_TOKEN>
Body: { "zone": "<ZONE>", "url": "...", "format": "raw", "country": "us" }

Variable	Where to find it
`BRIGHTDATA_API_TOKEN`	Dashboard → Account Settings → API Tokens
`BRIGHTDATA_WEB_UNLOCKER_ZONE`	Proxies & Scraping Infrastructure → Web Unlocker → zone name

Optional tuning: BRIGHTDATA_COUNTRY, BRIGHTDATA_TIMEOUT_SECONDS, SCRAPER_REQUEST_DELAY_SECONDS

Environment reference

Scraper & Reddit

Variable	Default	Notes
`SENTIMENT_MAX_QUOTES`	`50`	Max posts + comments per bot
`REDDIT_SEARCH_LIMIT`	`100`	Posts per Reddit API page
`REDDIT_SEARCH_PAGES`	`2`	Pagination depth per search
`REDDIT_COMMENT_POSTS`	`12`	Top threads to fetch comments from
`REDDIT_MAX_COMMENTS_PER_POST`	`30`	Comments per thread
`REDDIT_SUBREDDITS`	`Battlebots,robotwars`	Comma-separated

LLM & RAG

Variable	Default	Notes
`LLM_PROVIDER`	`openai`	`openai` or `anthropic`
`OPENAI_MODEL`	`gpt-5.4`	Chat model for reports
`ANTHROPIC_MODEL`	`claude-sonnet-4-20250514`	Chat model for reports
`RAG_ENABLED`	`true`	Set `false` to skip retrieval
`EMBEDDING_MODEL`	`text-embedding-3-small`	Used when indexing/retrieving
`RAG_TOP_K_PER_BOT`	`8`	Chunks retrieved per bot at predict time

See backend/.env.example for the full list.

How prediction works

Pick two bots
    → Load profiles, match history, sentiment from SQLite
    → RAG: embed matchup query, retrieve top-K Reddit chunks per bot
    → Build numbered evidence catalog (facts must be cited as [F001], …)
    → LLM writes scouting report + winner + confidence
    → Cache result in predictions table

Sentiment scraping also indexes chunks into sentiment_chunks automatically. Run python -m backend.ai.index_sentiment to rebuild embeddings from existing rows without hitting Bright Data again.

Running tests

No network or API keys required for the test suite:

uv run pytest

Inspecting the database

sqlite3 backend/data/battlebots.db

sqlite> SELECT name, weapon_type FROM bots LIMIT 10;
sqlite> SELECT COUNT(*) FROM matches;
sqlite> SELECT COUNT(*) FROM sentiment;
sqlite> SELECT COUNT(*) FROM sentiment_chunks;
sqlite> SELECT COUNT(*) FROM predictions;

Publishing to GitHub

Never commit secrets:

File	Commit?
`backend/.env.example`	Yes
`backend/.env`	No
`frontend/.env.example`	Yes
`frontend/.env`	No

git status   # .env files must not be staged

Rotate Bright Data and LLM keys if they were ever committed or shared.

Design notes

Parsers are pure — HTML/JSON parsing lives under scrapers/parsers/ with no I/O
Scrapers are idempotent — INSERT … ON CONFLICT … DO UPDATE refreshes rows in place
Failures are isolated — one bot timing out does not stop the full run
Evidence is auditable — every LLM claim maps to a numbered fact with a source URL
RAG is optional at runtime — if embeddings are missing, predictions fall back to stored post samples

Known limitations

battlebots.com is Webflow-rendered; DOM changes can reduce listing scrape yield (seed roster fallback exists)
Fandom tournament tables vary by season; some rows log as unparseable summary tables
Reddit occasionally returns non-JSON through Bright Data on comment fetches — those requests are skipped, not fatal
X (Twitter) scraping is implemented but disabled by default (--sources reddit x to opt in)
RAG uses in-memory cosine similarity over SQLite-stored vectors — fine at this scale, not built for millions of chunks

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BattleBots AI Fight Predictor

What it does

Project status

Layout

Quick start

1. Install dependencies

2. Configure secrets

3. Scrape data

4. Run the app

API overview

Bright Data setup

Environment reference

Scraper & Reddit

LLM & RAG

How prediction works

Running tests

Inspecting the database

Publishing to GitHub

Design notes

Known limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BattleBots AI Fight Predictor

What it does

Project status

Layout

Quick start

1. Install dependencies

2. Configure secrets

3. Scrape data

4. Run the app

API overview

Bright Data setup

Environment reference

Scraper & Reddit

LLM & RAG

How prediction works

Running tests

Inspecting the database

Publishing to GitHub

Design notes

Known limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages