Skip to content

spider-rs/gottem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gottem — universal scraper that always gets the data

Universal scraper that always gets the data.


What is gottem

gottem is one CLI and one Rust library that talks to every major scraping vendor — and your local browser — through a single tiered ladder. You give it a URL; it tries the cheapest way first, escalates when blocked, races vendors when speed matters, and stops when it gets clean content.

Each "way to fetch" is called a route. Routes are described in TOML and matched to one of a small set of adapters (plain HTTP, JSON API, streaming JSONL, headless Chrome over CDP, CAPTCHA solver). Adding a new vendor is a TOML row — no code change, no release.

Powered by spider.


Install

cd crates/gottem-cli
cargo install --path .

This installs the gottem binary on your PATH.


Try it in 30 seconds

# Inspect what's available — no API keys needed yet.
gottem routes list
gottem routes show spider.cloud.smart

# Tell gottem which vendor keys you have.
export FIRECRAWL_API_KEY=fc-...
export SPIDER_CLOUD_API_KEY=sk-...

# Fetch a URL. gottem starts cheap, escalates if the cheap routes fail.
gottem fetch https://example.com --show-meta

# Race three routes in parallel — fastest valid response wins.
gottem fetch https://example.com --mode race --routes firecrawl.scrape,spider.cloud.http,zenrows.basic

# Hedge: start cheap, fire a backup at the next tier after a delay.
gottem fetch https://example.com --mode hedge --hedge-delay-ms 2000

# Probe every tier on a target URL — useful for picking a baseline.
gottem probe https://hard-to-scrape.test

The tier ladder

Lower tier = cheaper and faster. Higher tier = handles tougher anti-bot defenses. gottem walks the ladder cheapest-first by default and stops at the first route that returns valid content.

Tier Typical cost What's at this level
T0 free direct local HTTP (you bring the URL, we send a GET)
T1–T3 varies local HTTP through a proxy, or local headless Chrome
T4 $0.001 basic cloud HTTP (Firecrawl, Spider Cloud HTTP, ScrapingBee, ZenRows)
T5 $0.005 cloud HTTP with JS render
T6 $0.0075 cloud HTTP + residential proxy
T7 $0.008–0.010 smart unblockers — auto-fallback inside the vendor (Spider Smart, Zyte, Brightdata Unblocker)
T8 $0.010–0.015 browser-as-a-service over CDP (Brightdata Scraping Browser, Browserless, Spider Browser Cloud)
T9 $0.02+ last-resort: multi-step actors, premium scraping APIs, CAPTCHA solvers

You can pin the tier band you want with --tier-min / --tier-max, or hard-cap cost per fetch with --budget-mc.


Built-in vendors

20 routes across 11 services. All you need is the env var.

Vendor Routes (count) Env var
Spider Cloud 4 SPIDER_CLOUD_API_KEY
Firecrawl 2 FIRECRAWL_API_KEY
ZenRows 3 ZENROWS_API_KEY
ScrapingBee 3 SCRAPINGBEE_API_KEY
Brightdata Web Unlocker 1 BRIGHTDATA_TOKEN
Zyte API 1 ZYTE_API_KEY
Brightdata Scraping Browser 1 BRIGHTDATA_BROWSER
Browserless 1 BROWSERLESS_TOKEN
Spider Browser Cloud 1 SPIDER_CLOUD_API_KEY (shared)
Apify 1 APIFY_API_TOKEN
Oxylabs Web Scraper 1 OXYLABS_USER + OXYLABS_PASS
2Captcha solver 1 2CAPTCHA_API_KEY (¹)

Don't see your vendor? Drop a TOML file in crates/gottem-routes-builtin/routes/ and you're done. See Adding a vendor below.

¹ 2CAPTCHA_API_KEY starts with a digit, so POSIX shells (bash, zsh) refuse export 2CAPTCHA_API_KEY=.... Use a .env loader, prefix the binary with env 2CAPTCHA_API_KEY=..., or inject through your CI's secret store. Rust reads it via std::env::var regardless of how it got set.


The three modes

--mode ladder (default)

Try cheapest first. If the response fails validation (too short, WAF challenge, 5xx), escalate one tier and try again. Stop at the first valid response, the budget ceiling, or --max-retries.

Best for: most batch jobs. Cost-optimal.

--mode race

Fire all selected routes in parallel. First valid response wins; the rest are cancelled mid-flight.

Best for: latency-critical fetches when budget allows duplicate cost.

--mode hedge

Fire route 0 at t=0. If it doesn't return quickly, fire route 1 at t = --hedge-delay-ms. Then route 2 at 2× that delay, and so on. First valid wins. The delay shrinks adaptively when latency variance is bad — slow tails get hedged more aggressively automatically.

Best for: high-throughput pipelines where most fetches are cheap but the long tail kills you.


CAPTCHA chains

gottem ships a 2Captcha adapter at T9 that you compose into your pipeline when a vendor returns a challenge page:

  1. Run the primary fetch through the ladder.
  2. Detect a CAPTCHA in the response (your code or a validator).
  3. Call the captcha.2captcha route, passing siteKey + captchaType in req.extra.
  4. Receive a solved token as content.
  5. Replay the original URL with the token embedded (cookie / form field / header — depends on the captcha).

The solver handles 2Captcha's two-step submit-then-poll protocol internally — you just call it once. Supports reCAPTCHA v2, hCaptcha, and Cloudflare Turnstile.


Routes are config, not code

Every vendor in gottem is one TOML row. Here's the entire Firecrawl route:

[[route]]
id          = "firecrawl.scrape"
adapter     = "http_json"
endpoint    = "https://api.firecrawl.dev/v1/scrape"
method      = "POST"
tier        = 4
cost        = 10
timeout_ms  = 30000

[route.auth]
kind = "bearer"
env  = "FIRECRAWL_API_KEY"

[route.body]
kind     = "json"
template = '''{"url":"{{url}}","formats":["markdown"]}'''

[route.parse]
kind = "json_path"
path = "$.data.markdown"

[[route.validate]]
kind = "min_bytes"
n    = 500

Adding ZenRows-style query-string auth is the same pattern with {{env:NAME}} in the endpoint URL. There are five adapters that cover essentially every scraping API in the wild:

  • direct_http — plain GET/POST
  • http_json — POST JSON, parse JSON (Firecrawl, Zyte, Brightdata, Apify, Oxylabs)
  • http_jsonl_stream — POST JSON, parse streaming JSONL (Spider Cloud)
  • chrome_cdp — WebSocket CDP (Brightdata Scraping Browser, Browserless)
  • captcha_2captcha — submit + poll (2Captcha)

You can also point gottem at your own --config routes.toml to layer custom routes on top of the built-ins.


Modes recap

gottem fetch URL                                  # ladder, default
gottem fetch URL --mode race --routes a,b,c       # race A B C in parallel
gottem fetch URL --mode hedge --hedge-count 2     # primary + 2 staggered backups
gottem fetch URL --budget-mc 100                  # cap at $0.01 per fetch
gottem fetch URL --tier-min 4 --tier-max 7        # skip local; cap below T8
gottem fetch URL --require-js                     # only routes that render JS
gottem fetch URL --format json                    # structured output with metadata

Using it as a library

use std::sync::Arc;
use gottem_core::{Budget, CancelToken, LadderStrategy, Orchestrator,
                  RouteCatalogBuilder, ScrapeRequest, Tier, AdapterRegistry, Capabilities};
use url::Url;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let catalog = Arc::new(
        gottem_routes_builtin::register_all(RouteCatalogBuilder::new())?.build()
    );

    let mut registry = AdapterRegistry::new();
    gottem_adapters_http::register_all(&mut registry, None);
    registry.register(gottem_adapters_spider::SpiderAdapter::arc());

    let orch = Arc::new(Orchestrator::new(
        catalog.clone(),
        Arc::new(registry),
        Arc::new(Budget::new(1_000)),  // $0.10 ceiling
    ));

    let strategy = Arc::new(LadderStrategy::new(
        catalog.clone(), Tier::T0, Tier::T9, Capabilities::default(), 5,
    ));

    let resp = orch.fetch_cheap(
        ScrapeRequest::get(Url::parse("https://example.com")?),
        strategy,
        CancelToken::new(),
    ).await?;

    println!("{}", resp.content.unwrap_or_default());
    Ok(())
}

Inspecting the catalog

gottem routes list           # tabular view of every loaded route
gottem routes show <id>      # full detail for one route
gottem routes validate       # check that every route's env var is set

routes validate exits 0 when every env var is present, exits 2 with a list otherwise — handy in CI.


What's inside

gottem/
├── assets/                          logo, dark-mode logo, icon
└── crates/
    ├── gottem-core                  traits, types, orchestrator, retry strategies
    ├── gottem-adapters-http         direct_http · http_json · http_jsonl_stream
    ├── gottem-adapters-spider       T0–T3 local fetching via spider::Website
    ├── gottem-adapters-chrome       T8 CDP via spider::chromiumoxide
    ├── gottem-adapters-captcha      T9 2Captcha solver chain primitive
    ├── gottem-routes-builtin        embedded vendor TOML, feature-gated per vendor
    └── gottem-cli                   `gottem` binary — fetch · probe · routes

Every adapter and every vendor is behind a Cargo feature, so you can build a CLI with only the routes you actually need.


License

Apache-2.0 OR MIT, your choice.

About

Universal scraper that always gets the data. Tiered ladder across 11 vendors with race + hedge + budget modes. Built on spider.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages