Self-hosted AI-powered web scraping. Tell it what you want to track in plain English, and it turns any page into structured data, watches for changes, and notifies you by email or Telegram. Bring your own API key from OpenAI, Anthropic, or OpenRouter — SmartScrape never ships with keys or phones home.
- Ask for what you want in plain English. "Track laptop prices and alert me on drops > 10%" becomes a full scrape config with extraction schema, comparison key, and notification rules.
- JS-rendered pages handled. Static HTML with Cheerio when it works, headless Chromium (Playwright) when it doesn't — auto-detected.
- Change detection that's useful. SHA-256 hashes per row, matched across runs by a user-chosen comparison key. You get added / removed / changed items, not a full diff dump.
- Notifications that actually fit. Rule types: any change, new items, removed items, field threshold (
price < 500), field value change (stock_statusflipped). Sent via email, Telegram, or both. - Everywhere your data goes, you control. Export to Google Sheets (OAuth) or CSV. No third-party analytics, no outbound data.
- Security built in. CORS locked to your frontend, Helmet headers, bcrypt-hashed passwords, AES-256-GCM encrypted API keys + OAuth tokens, JWT with rotating refresh tokens, SSRF guard on user-supplied URLs, prompt-injection hardening on every extraction.
Paste a URL, describe the goal, pick a model. SmartScrape fetches the page, cleans the HTML, and proposes an extraction schema + notification rules. Accept, tweak, or start over with manual setup.
Filter by active / paused / failed. Schedules are plain cron or presets (manual, hourly, daily, weekly). Toggle to pause without deleting.
Latest runs at a glance: status, items extracted, tokens burned, duration, source URLs. CSV per run, diff vs previous run, Push to Sheets on demand.
Prompt, schema, scrape method, schedule, AI model, notification rules, and linked Google Sheet — all editable, including a picker that pulls from your Drive.
AI provider keys (never displayed after save), Google Sheets OAuth, Telegram bot setup. Keys are encrypted at rest; the test button round-trips a real auth call so misconfigurations surface immediately.
Prereqs: Docker (or Node 20+ for the dev path), npm 10+.
git clone https://github.com/9ny4/smartscrape.git
cd smartscrape
cp .env.example .env
# Fill in the three required secrets — generation snippets are inside .env.example.
docker compose --profile app up -d --buildOpen http://localhost:3000. Migrations run on container start; the API also serves the built SPA from the same origin.
git clone https://github.com/9ny4/smartscrape.git
cd smartscrape
npm install
npx playwright install chromium
cp .env.example .env
# Same three secrets as above.
docker compose up -d # Postgres + Redis only
npm run migrate:up --workspace server
npm run dev # API on :3000, Vite SPA on :5173- Frontend: http://localhost:5173
- API: http://localhost:3000
- Health: http://localhost:3000/api/health
Create an account, add at least one AI provider key under Settings → AI Providers, then head to Jobs → New job.
- Google Sheets. Create a Google Cloud OAuth client (web type), enable the Sheets API + Drive API, set
GOOGLE_CLIENT_ID/GOOGLE_CLIENT_SECRET, and addhttp://localhost:3000/api/google/callbackas an authorized redirect URI. Connect under Settings → Google Sheets. - Email. Fill in
SMTP_HOST/SMTP_PORT/SMTP_USER/SMTP_PASS(or use Ethereal for dev). Without SMTP, emails are written to the server log instead of sent. - Telegram. Create a bot via @BotFather, set
TELEGRAM_BOT_TOKEN. Each user then pastes their ownchat_idunder Settings → Telegram.
| Command | What it does |
|---|---|
npm run dev |
Server + client with hot reload |
npm run build |
Production build, both workspaces |
npm run typecheck |
TypeScript check |
npm run lint |
ESLint |
npm run migrate:up --workspace server |
Apply database migrations |
npm run test:e2e --workspace client |
Playwright smoke suite (requires dev server running) |
npm run docs:screenshots --workspace client |
Regenerate the README screenshots |
@smartscrape/cli is a headless client for the REST API — built so cron jobs, scripts, and external agents can drive SmartScrape without a browser. Source lives in cli/.
# Build the CLI workspace
npm install
npm run -w @smartscrape/cli build
# Sign in (writes ~/.smartscrape/config.json) — or set SMARTSCRAPE_URL + SMARTSCRAPE_TOKEN
node cli/dist/index.js auth login --url http://localhost:3000 --email you@example.com --password '...'
# Drive everything
node cli/dist/index.js jobs list --json
node cli/dist/index.js jobs run <job-id> --wait --json
node cli/dist/index.js results <job-id> --json
node cli/dist/index.js export <job-id> --csv > out.csvEvery command supports --json, --quiet, --server-url, --token, --api-key. Exit codes: 0 success, 1 generic, 2 auth, 3 not found, 4 validation. See cli/README.md for the full command list.
Long-running automation should use a personal access token over JWT: smartscrape auth tokens create --name ci-runner mints one, plaintext is shown once, send it on every request via SMARTSCRAPE_API_KEY env or the X-API-Key header. Revoke any token from Settings or via smartscrape auth tokens revoke <id>.
Each job has four knobs that help against targets with active bot detection:
stealth_mode— turns on UA rotation (a small pool of recent real-browser UAs, deterministic per job id so the target sees a stable identity across runs) and injects a minimal Playwright stealth init script (hidesnavigator.webdriver, plausibly populatesplugins/languages, fixes the headless-Chromepermissions.queryquirk).proxy_url— per-jobhttp(s)://[user:pass@]host:port. Applies to both the static (Cheerio) and rendered (Playwright) paths.pacing_min_ms/pacing_max_ms— uniform-random sleep between successive URLs in a multi-URL job. The existing per-host throttle still runs underneath.- Process-wide
HTTPS_PROXY/HTTP_PROXYenv — when set at server start, every outbound fetch (scraper, AI SDKs, webhook delivery) routes through the configured proxy. Per-jobproxy_urloverrides on the scrape path.
smartscrape jobs edit <job-id> --stealth --proxy-url http://user:pass@proxy:3128 --pacing-min 500 --pacing-max 1500Failed runs are classified into one of seven buckets — timeout, blocked, parse_error, ai_error, network_error, quota_error, unknown — and the type lands on the run row (error_type), the jobs list (last_run_error_type), and any webhook payload. After three consecutive failures, a job is auto-paused (enabled=false) and a job_failed notification fires on the user's configured channels. Re-enable with the toggle endpoint or smartscrape jobs toggle <id>.
Configure a webhook on a job and SmartScrape POSTs the run results to that URL after every terminal run (completed or failed). Set the URL on create or edit:
smartscrape jobs edit <job-id> --webhook-url https://example.com/hook --webhook-secret '<long-secret>'
smartscrape jobs webhook test <job-id> # send a synthetic payload nowPayload shape: { event, job_id, job_name, run_id, status, items_count, urls_scraped, tokens_used, error_message, started_at, completed_at, changes: { added, removed, changed }, items: [...] }. When a secret is configured, the request carries X-Webhook-Signature: sha256=<hmac> over the raw body and X-Webhook-Timestamp. Delivery retries up to 3 times with 1s → 4s backoff; the outcome is persisted on webhook_status, webhook_attempts, and webhook_last_error on the run row.
- You hit Run now (or cron fires the job).
- Each URL is scraped. Auto mode tries Cheerio first; if the page looks empty it falls back to Playwright. The method is configurable per job.
- The HTML is sanitized — scripts, hidden elements, and instruction-shaped text stripped before anything is sent to the model.
- The AI gets a bounded extraction prompt with your schema and returns a JSON array. Output is validated (schema types, size cap, secret-leak guard).
- Each extracted item is hashed and compared to the previous run using your comparison key to determine added / removed / changed.
- Notification rules run against the diff. Email, Telegram, or both — based on the job's
notify_channelsand each rule's trigger. - Data is stored, optionally pushed to Google Sheets, and the run is marked completed.
┌──────────────┐
│ React SPA │ Vite + Tailwind, JWT in localStorage
│ (port 5173) │
└──────┬───────┘
│ /api/* (CORS locked to APP_URL)
┌──────▼───────┐ ┌─────────────────┐
│ Express API │ enqueue │ BullMQ worker │
│ (port 3000) │────────►│ (in same proc) │
└──┬─────────┬─┘ └────────┬────────┘
│ │ │ each tick:
│ │ │ scrape → AI extract →
│ │ │ diff → notify → store
│ │ │
┌──────▼─┐ ┌──▼────┐ ┌────▼─────────┐
│Postgres│ │ Redis │ │ AI provider │
│ 16 │ │ 7 │ │ (user key) │
└────────┘ └───────┘ └──────────────┘
Server (server/) — Express + TypeScript. Auth (bcrypt + JWT + hashed refresh tokens), scrape job CRUD, an AI setup wizard, the runner (BullMQ scheduler + Cheerio/Playwright fetcher + AI extractor + change detector + notifier), and Google Sheets / Telegram / email integrations.
Client (client/) — React + Vite + Tailwind SPA. Routes for the dashboard, jobs list, job detail with diff view, new-job wizard, settings, and notification history. Talks to /api via the Vite dev proxy in development; same-origin in production.
Database (server/migrations/) — Postgres. Users, refresh tokens, encrypted API keys, scrape jobs, runs, extracted data (with SHA-256 hashes for diffing), notification log, Google connections, settings, AI setup logs.
Queue — Redis-backed BullMQ. Manual triggers go through enqueueNow; scheduled jobs go through upsertJobScheduler keyed per job. The worker runs in the same Node process as the API in v1; splitting it is a one-line change for production.
Scraping pipeline — auto mode tries Cheerio first; if visible-text length is below threshold (likely an SPA), it falls back to a headless Chromium context with route interception that aborts requests resolving to private IPs (SSRF defense). HTML is sanitized before reaching the model.
Extraction pipeline — provider-agnostic chat call (ai-providers.ts) → JSON parse with fence stripping → schema type validation → secret-leak guard (split into floored API keys + unfloored emails) → size cap → HTML-escape sanitize → SHA-256 hash → store. The validator (validateExtractedItems) is pure and unit-tested.
Change detection — items matched across runs by user-chosen comparison key. New / removed / changed buckets feed the rule engine, which renders templated messages and dispatches via email and/or Telegram.
- Job — a URL (or up to 10), what to extract, how often, and what to do on changes.
- Run — a single execution of a job. Every job has a run history; each run is reproducible and diffable.
- Schema —
{ field_name: "string" | "number" | "boolean" | "array" | "object" }. Types are enforced on AI output. - Comparison key — the field that uniquely identifies an item across runs (
url,sku,id). Without one, change detection can't match items, so you only get all-or-nothing "data changed." - Notification rule —
any_change,new_items,removed_items,field_threshold,field_change. Each can have a templated message with{field_name},{old},{new},{count},{url}.
- User passwords: bcrypt, 12+ rounds
- JWT: 15-minute access token + 7-day refresh token, refresh tokens stored hashed (SHA-256) so they're revocable
- Provider keys and Google OAuth tokens: AES-256-GCM encrypted at rest; encryption key is a required env var
- AI extraction: data-boundary-marked prompts, output validated (schema, size, secret leak), never trusts HTML content as instructions
- User URLs: private / loopback / metadata ranges rejected (SSRF)
- Rate limits: 5/min on auth entry routes, 100 runs/user/day
- HTTP: Helmet defaults, CORS locked to
APP_URL
smartscrape/
server/ # Express + TS API, migrations, BullMQ worker
client/ # React + Vite + Tailwind SPA + Playwright smoke tests
cli/ # Headless CLI client (commander + native fetch)
docs/ # Screenshots, assets
docker-compose.yml # Postgres 16 + Redis 7 for local dev
MIT.







