Skip to content

techwithtim/BattleBotsApp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BattleBots AI Fight Predictor

Scrape BattleBots data with Bright Data, store it in SQLite, enrich Reddit fan sentiment with a basic RAG pipeline, and generate AI scouting reports for any bot matchup — all through a FastAPI backend and React UI.

What it does

  1. Scrape bot profiles, fight history, and Reddit posts/comments via Bright Data Web Unlocker
  2. Store everything in a local SQLite database
  3. Embed Reddit text into vector chunks for retrieval at prediction time
  4. Predict winners with an LLM (OpenAI or Anthropic), citing numbered evidence facts
  5. Explore the data and read reports in the web app

Project status

Area Status
Bright Data scrapers (bots, matches, sentiment) Done
Reddit deep scrape (pagination, comments, multi-subreddit) Done
RAG pipeline (embed + retrieve Reddit chunks) Done
LLM prediction engine with evidence catalog Done
FastAPI REST API Done
React + Vite frontend Done

Layout

backend/
├── .env.example
├── config.py                 # Pydantic settings
├── main.py                   # FastAPI app
├── predictor.py              # Matchup prediction orchestration
├── ai/
│   ├── llm_client.py         # OpenAI / Anthropic scouting reports
│   ├── embeddings.py         # OpenAI embedding calls
│   ├── rag.py                # Chunk indexing + similarity retrieval
│   ├── index_sentiment.py    # Re-embed existing sentiment without re-scraping
│   ├── evidence.py           # Numbered fact catalog for LLM citations
│   └── prompts.py
├── api/                      # REST routes (bots, predict, explorer, logs, stats)
├── db/
│   ├── schema.py             # bots, matches, sentiment, sentiment_chunks, predictions
│   ├── database.py
│   └── repositories.py
├── scrapers/
│   ├── brightdata_client.py
│   ├── scrape_bots.py
│   ├── scrape_matches.py
│   ├── scrape_sentiment.py   # Reddit posts + comments + RAG indexing
│   ├── run_all_scrapers.py
│   └── parsers/
└── tests/

frontend/
├── src/
│   ├── pages/                # Home, predictions list, prediction detail
│   ├── components/           # Bot picker, scouting report, data explorer, …
│   └── lib/api.ts            # Calls backend via /api proxy in dev
└── vite.config.ts

Quick start

1. Install dependencies

uv is recommended. From the repo root:

uv sync --extra dev
cd frontend && npm install && cd ..

2. Configure secrets

cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env   # optional in local dev

Edit backend/.env:

Required for scrapers Variable
Yes BRIGHTDATA_API_TOKEN
Yes BRIGHTDATA_WEB_UNLOCKER_ZONE
Required for predictions Variable
One of OPENAI_API_KEY or ANTHROPIC_API_KEY
Set LLM_PROVIDER=openai or anthropic
Required for RAG Variable
Yes OPENAI_API_KEY (uses text-embedding-3-small)
Optional RAG_ENABLED=true (default)

You can run scrapers with Bright Data only. Predictions need an LLM key. RAG needs an OpenAI key for embeddings (predictions still work without RAG, using a fallback slice of stored posts).

3. Scrape data

# All three stages: bots → matches → sentiment (+ RAG indexing)
uv run python -m backend.scrapers.run_all_scrapers

# Sentiment only (Reddit + embed for RAG)
uv run python -m backend.scrapers.scrape_sentiment

# Re-embed existing sentiment without re-scraping
uv run python -m backend.ai.index_sentiment

Quick smoke test:

uv run python -m backend.scrapers.run_all_scrapers --limit-bots 3

Database path: backend/data/battlebots.db

4. Run the app

Terminal 1 — API:

uv run uvicorn backend.main:app --reload --host 127.0.0.1 --port 8000

Terminal 2 — frontend:

cd frontend && npm run dev

Open http://localhost:5173/ — Vite proxies /api to the backend, so no CORS setup is needed in local dev.

API docs: http://127.0.0.1:8000/docs


API overview

Method Path Purpose
GET /health Health check
GET /stats Row counts + active LLM provider/model
GET /bots Bot roster
GET /bots/{id} Bot profile, matches, sentiment
POST /predict Generate scouting report for two bots
GET /predictions Recent predictions
GET /predictions/{id} Full prediction detail
GET /explorer/* Paginated bots, matches, sentiment, predictions
GET /logs Live scraper/server log stream

Bright Data setup

Scrapers call Bright Data's Web Unlocker API:

POST https://api.brightdata.com/request
Authorization: Bearer <BRIGHTDATA_API_TOKEN>
Body: { "zone": "<ZONE>", "url": "...", "format": "raw", "country": "us" }
Variable Where to find it
BRIGHTDATA_API_TOKEN Dashboard → Account Settings → API Tokens
BRIGHTDATA_WEB_UNLOCKER_ZONE Proxies & Scraping Infrastructure → Web Unlocker → zone name

Optional tuning: BRIGHTDATA_COUNTRY, BRIGHTDATA_TIMEOUT_SECONDS, SCRAPER_REQUEST_DELAY_SECONDS


Environment reference

Scraper & Reddit

Variable Default Notes
SENTIMENT_MAX_QUOTES 50 Max posts + comments per bot
REDDIT_SEARCH_LIMIT 100 Posts per Reddit API page
REDDIT_SEARCH_PAGES 2 Pagination depth per search
REDDIT_COMMENT_POSTS 12 Top threads to fetch comments from
REDDIT_MAX_COMMENTS_PER_POST 30 Comments per thread
REDDIT_SUBREDDITS Battlebots,robotwars Comma-separated

LLM & RAG

Variable Default Notes
LLM_PROVIDER openai openai or anthropic
OPENAI_MODEL gpt-5.4 Chat model for reports
ANTHROPIC_MODEL claude-sonnet-4-20250514 Chat model for reports
RAG_ENABLED true Set false to skip retrieval
EMBEDDING_MODEL text-embedding-3-small Used when indexing/retrieving
RAG_TOP_K_PER_BOT 8 Chunks retrieved per bot at predict time

See backend/.env.example for the full list.


How prediction works

Pick two bots
    → Load profiles, match history, sentiment from SQLite
    → RAG: embed matchup query, retrieve top-K Reddit chunks per bot
    → Build numbered evidence catalog (facts must be cited as [F001], …)
    → LLM writes scouting report + winner + confidence
    → Cache result in predictions table

Sentiment scraping also indexes chunks into sentiment_chunks automatically. Run python -m backend.ai.index_sentiment to rebuild embeddings from existing rows without hitting Bright Data again.


Running tests

No network or API keys required for the test suite:

uv run pytest

Inspecting the database

sqlite3 backend/data/battlebots.db

sqlite> SELECT name, weapon_type FROM bots LIMIT 10;
sqlite> SELECT COUNT(*) FROM matches;
sqlite> SELECT COUNT(*) FROM sentiment;
sqlite> SELECT COUNT(*) FROM sentiment_chunks;
sqlite> SELECT COUNT(*) FROM predictions;

Publishing to GitHub

Never commit secrets:

File Commit?
backend/.env.example Yes
backend/.env No
frontend/.env.example Yes
frontend/.env No
git status   # .env files must not be staged

Rotate Bright Data and LLM keys if they were ever committed or shared.


Design notes

  • Parsers are pure — HTML/JSON parsing lives under scrapers/parsers/ with no I/O
  • Scrapers are idempotentINSERT … ON CONFLICT … DO UPDATE refreshes rows in place
  • Failures are isolated — one bot timing out does not stop the full run
  • Evidence is auditable — every LLM claim maps to a numbered fact with a source URL
  • RAG is optional at runtime — if embeddings are missing, predictions fall back to stored post samples

Known limitations

  1. battlebots.com is Webflow-rendered; DOM changes can reduce listing scrape yield (seed roster fallback exists)
  2. Fandom tournament tables vary by season; some rows log as unparseable summary tables
  3. Reddit occasionally returns non-JSON through Bright Data on comment fetches — those requests are skipped, not fatal
  4. X (Twitter) scraping is implemented but disabled by default (--sources reddit x to opt in)
  5. RAG uses in-memory cosine similarity over SQLite-stored vectors — fine at this scale, not built for millions of chunks

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors