Scrape BattleBots data with Bright Data, store it in SQLite, enrich Reddit fan sentiment with a basic RAG pipeline, and generate AI scouting reports for any bot matchup — all through a FastAPI backend and React UI.
- Scrape bot profiles, fight history, and Reddit posts/comments via Bright Data Web Unlocker
- Store everything in a local SQLite database
- Embed Reddit text into vector chunks for retrieval at prediction time
- Predict winners with an LLM (OpenAI or Anthropic), citing numbered evidence facts
- Explore the data and read reports in the web app
| Area | Status |
|---|---|
| Bright Data scrapers (bots, matches, sentiment) | Done |
| Reddit deep scrape (pagination, comments, multi-subreddit) | Done |
| RAG pipeline (embed + retrieve Reddit chunks) | Done |
| LLM prediction engine with evidence catalog | Done |
| FastAPI REST API | Done |
| React + Vite frontend | Done |
backend/
├── .env.example
├── config.py # Pydantic settings
├── main.py # FastAPI app
├── predictor.py # Matchup prediction orchestration
├── ai/
│ ├── llm_client.py # OpenAI / Anthropic scouting reports
│ ├── embeddings.py # OpenAI embedding calls
│ ├── rag.py # Chunk indexing + similarity retrieval
│ ├── index_sentiment.py # Re-embed existing sentiment without re-scraping
│ ├── evidence.py # Numbered fact catalog for LLM citations
│ └── prompts.py
├── api/ # REST routes (bots, predict, explorer, logs, stats)
├── db/
│ ├── schema.py # bots, matches, sentiment, sentiment_chunks, predictions
│ ├── database.py
│ └── repositories.py
├── scrapers/
│ ├── brightdata_client.py
│ ├── scrape_bots.py
│ ├── scrape_matches.py
│ ├── scrape_sentiment.py # Reddit posts + comments + RAG indexing
│ ├── run_all_scrapers.py
│ └── parsers/
└── tests/
frontend/
├── src/
│ ├── pages/ # Home, predictions list, prediction detail
│ ├── components/ # Bot picker, scouting report, data explorer, …
│ └── lib/api.ts # Calls backend via /api proxy in dev
└── vite.config.ts
uv is recommended. From the repo root:
uv sync --extra dev
cd frontend && npm install && cd ..cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env # optional in local devEdit backend/.env:
| Required for scrapers | Variable |
|---|---|
| Yes | BRIGHTDATA_API_TOKEN |
| Yes | BRIGHTDATA_WEB_UNLOCKER_ZONE |
| Required for predictions | Variable |
|---|---|
| One of | OPENAI_API_KEY or ANTHROPIC_API_KEY |
| Set | LLM_PROVIDER=openai or anthropic |
| Required for RAG | Variable |
|---|---|
| Yes | OPENAI_API_KEY (uses text-embedding-3-small) |
| Optional | RAG_ENABLED=true (default) |
You can run scrapers with Bright Data only. Predictions need an LLM key. RAG needs an OpenAI key for embeddings (predictions still work without RAG, using a fallback slice of stored posts).
# All three stages: bots → matches → sentiment (+ RAG indexing)
uv run python -m backend.scrapers.run_all_scrapers
# Sentiment only (Reddit + embed for RAG)
uv run python -m backend.scrapers.scrape_sentiment
# Re-embed existing sentiment without re-scraping
uv run python -m backend.ai.index_sentimentQuick smoke test:
uv run python -m backend.scrapers.run_all_scrapers --limit-bots 3Database path: backend/data/battlebots.db
Terminal 1 — API:
uv run uvicorn backend.main:app --reload --host 127.0.0.1 --port 8000Terminal 2 — frontend:
cd frontend && npm run devOpen http://localhost:5173/ — Vite proxies /api to the backend, so no CORS
setup is needed in local dev.
API docs: http://127.0.0.1:8000/docs
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Health check |
| GET | /stats |
Row counts + active LLM provider/model |
| GET | /bots |
Bot roster |
| GET | /bots/{id} |
Bot profile, matches, sentiment |
| POST | /predict |
Generate scouting report for two bots |
| GET | /predictions |
Recent predictions |
| GET | /predictions/{id} |
Full prediction detail |
| GET | /explorer/* |
Paginated bots, matches, sentiment, predictions |
| GET | /logs |
Live scraper/server log stream |
Scrapers call Bright Data's Web Unlocker API:
POST https://api.brightdata.com/request
Authorization: Bearer <BRIGHTDATA_API_TOKEN>
Body: { "zone": "<ZONE>", "url": "...", "format": "raw", "country": "us" }
| Variable | Where to find it |
|---|---|
BRIGHTDATA_API_TOKEN |
Dashboard → Account Settings → API Tokens |
BRIGHTDATA_WEB_UNLOCKER_ZONE |
Proxies & Scraping Infrastructure → Web Unlocker → zone name |
Optional tuning: BRIGHTDATA_COUNTRY, BRIGHTDATA_TIMEOUT_SECONDS, SCRAPER_REQUEST_DELAY_SECONDS
| Variable | Default | Notes |
|---|---|---|
SENTIMENT_MAX_QUOTES |
50 |
Max posts + comments per bot |
REDDIT_SEARCH_LIMIT |
100 |
Posts per Reddit API page |
REDDIT_SEARCH_PAGES |
2 |
Pagination depth per search |
REDDIT_COMMENT_POSTS |
12 |
Top threads to fetch comments from |
REDDIT_MAX_COMMENTS_PER_POST |
30 |
Comments per thread |
REDDIT_SUBREDDITS |
Battlebots,robotwars |
Comma-separated |
| Variable | Default | Notes |
|---|---|---|
LLM_PROVIDER |
openai |
openai or anthropic |
OPENAI_MODEL |
gpt-5.4 |
Chat model for reports |
ANTHROPIC_MODEL |
claude-sonnet-4-20250514 |
Chat model for reports |
RAG_ENABLED |
true |
Set false to skip retrieval |
EMBEDDING_MODEL |
text-embedding-3-small |
Used when indexing/retrieving |
RAG_TOP_K_PER_BOT |
8 |
Chunks retrieved per bot at predict time |
See backend/.env.example for the full list.
Pick two bots
→ Load profiles, match history, sentiment from SQLite
→ RAG: embed matchup query, retrieve top-K Reddit chunks per bot
→ Build numbered evidence catalog (facts must be cited as [F001], …)
→ LLM writes scouting report + winner + confidence
→ Cache result in predictions table
Sentiment scraping also indexes chunks into sentiment_chunks automatically.
Run python -m backend.ai.index_sentiment to rebuild embeddings from existing
rows without hitting Bright Data again.
No network or API keys required for the test suite:
uv run pytestsqlite3 backend/data/battlebots.db
sqlite> SELECT name, weapon_type FROM bots LIMIT 10;
sqlite> SELECT COUNT(*) FROM matches;
sqlite> SELECT COUNT(*) FROM sentiment;
sqlite> SELECT COUNT(*) FROM sentiment_chunks;
sqlite> SELECT COUNT(*) FROM predictions;Never commit secrets:
| File | Commit? |
|---|---|
backend/.env.example |
Yes |
backend/.env |
No |
frontend/.env.example |
Yes |
frontend/.env |
No |
git status # .env files must not be stagedRotate Bright Data and LLM keys if they were ever committed or shared.
- Parsers are pure — HTML/JSON parsing lives under
scrapers/parsers/with no I/O - Scrapers are idempotent —
INSERT … ON CONFLICT … DO UPDATErefreshes rows in place - Failures are isolated — one bot timing out does not stop the full run
- Evidence is auditable — every LLM claim maps to a numbered fact with a source URL
- RAG is optional at runtime — if embeddings are missing, predictions fall back to stored post samples
- battlebots.com is Webflow-rendered; DOM changes can reduce listing scrape yield (seed roster fallback exists)
- Fandom tournament tables vary by season; some rows log as unparseable summary tables
- Reddit occasionally returns non-JSON through Bright Data on comment fetches — those requests are skipped, not fatal
- X (Twitter) scraping is implemented but disabled by default (
--sources reddit xto opt in) - RAG uses in-memory cosine similarity over SQLite-stored vectors — fine at this scale, not built for millions of chunks