Skip to content

sekkedev/smartscrape

Repository files navigation

SmartScrape

CI License: MIT Node TypeScript React Postgres

Self-hosted AI-powered web scraping. Tell it what you want to track in plain English, and it turns any page into structured data, watches for changes, and notifies you by email or Telegram. Bring your own API key from OpenAI, Anthropic, or OpenRouter — SmartScrape never ships with keys or phones home.

Home

What it's good at

  • Ask for what you want in plain English. "Track laptop prices and alert me on drops > 10%" becomes a full scrape config with extraction schema, comparison key, and notification rules.
  • JS-rendered pages handled. Static HTML with Cheerio when it works, headless Chromium (Playwright) when it doesn't — auto-detected.
  • Change detection that's useful. SHA-256 hashes per row, matched across runs by a user-chosen comparison key. You get added / removed / changed items, not a full diff dump.
  • Notifications that actually fit. Rule types: any change, new items, removed items, field threshold (price < 500), field value change (stock_status flipped). Sent via email, Telegram, or both.
  • Everywhere your data goes, you control. Export to Google Sheets (OAuth) or CSV. No third-party analytics, no outbound data.
  • Security built in. CORS locked to your frontend, Helmet headers, bcrypt-hashed passwords, AES-256-GCM encrypted API keys + OAuth tokens, JWT with rotating refresh tokens, SSRF guard on user-supplied URLs, prompt-injection hardening on every extraction.

A five-minute tour

Describe what you want, let AI do the rest

New job wizard

Paste a URL, describe the goal, pick a model. SmartScrape fetches the page, cleans the HTML, and proposes an extraction schema + notification rules. Accept, tweak, or start over with manual setup.

Jobs, one row each

Jobs list

Filter by active / paused / failed. Schedules are plain cron or presets (manual, hourly, daily, weekly). Toggle to pause without deleting.

One job, full context

Job Detail

Latest runs at a glance: status, items extracted, tokens burned, duration, source URLs. CSV per run, diff vs previous run, Push to Sheets on demand.

Edit anything

Edit job

Prompt, schema, scrape method, schedule, AI model, notification rules, and linked Google Sheet — all editable, including a picker that pulls from your Drive.

One place for credentials + integrations

Settings

AI provider keys (never displayed after save), Google Sheets OAuth, Telegram bot setup. Keys are encrypted at rest; the test button round-trips a real auth call so misconfigurations surface immediately.

Mobile works too

Home on mobile Jobs on mobile

Quickstart

Prereqs: Docker (or Node 20+ for the dev path), npm 10+.

Self-host (Docker, recommended)

git clone https://github.com/9ny4/smartscrape.git
cd smartscrape
cp .env.example .env
# Fill in the three required secrets — generation snippets are inside .env.example.

docker compose --profile app up -d --build

Open http://localhost:3000. Migrations run on container start; the API also serves the built SPA from the same origin.

Develop

git clone https://github.com/9ny4/smartscrape.git
cd smartscrape
npm install
npx playwright install chromium

cp .env.example .env
# Same three secrets as above.

docker compose up -d                       # Postgres + Redis only
npm run migrate:up --workspace server
npm run dev                                # API on :3000, Vite SPA on :5173

Create an account, add at least one AI provider key under Settings → AI Providers, then head to Jobs → New job.

Optional integrations

  • Google Sheets. Create a Google Cloud OAuth client (web type), enable the Sheets API + Drive API, set GOOGLE_CLIENT_ID / GOOGLE_CLIENT_SECRET, and add http://localhost:3000/api/google/callback as an authorized redirect URI. Connect under Settings → Google Sheets.
  • Email. Fill in SMTP_HOST / SMTP_PORT / SMTP_USER / SMTP_PASS (or use Ethereal for dev). Without SMTP, emails are written to the server log instead of sent.
  • Telegram. Create a bot via @BotFather, set TELEGRAM_BOT_TOKEN. Each user then pastes their own chat_id under Settings → Telegram.

Commands

Command What it does
npm run dev Server + client with hot reload
npm run build Production build, both workspaces
npm run typecheck TypeScript check
npm run lint ESLint
npm run migrate:up --workspace server Apply database migrations
npm run test:e2e --workspace client Playwright smoke suite (requires dev server running)
npm run docs:screenshots --workspace client Regenerate the README screenshots

CLI

@smartscrape/cli is a headless client for the REST API — built so cron jobs, scripts, and external agents can drive SmartScrape without a browser. Source lives in cli/.

# Build the CLI workspace
npm install
npm run -w @smartscrape/cli build

# Sign in (writes ~/.smartscrape/config.json) — or set SMARTSCRAPE_URL + SMARTSCRAPE_TOKEN
node cli/dist/index.js auth login --url http://localhost:3000 --email you@example.com --password '...'

# Drive everything
node cli/dist/index.js jobs list --json
node cli/dist/index.js jobs run <job-id> --wait --json
node cli/dist/index.js results <job-id> --json
node cli/dist/index.js export <job-id> --csv > out.csv

Every command supports --json, --quiet, --server-url, --token, --api-key. Exit codes: 0 success, 1 generic, 2 auth, 3 not found, 4 validation. See cli/README.md for the full command list.

Long-running automation should use a personal access token over JWT: smartscrape auth tokens create --name ci-runner mints one, plaintext is shown once, send it on every request via SMARTSCRAPE_API_KEY env or the X-API-Key header. Revoke any token from Settings or via smartscrape auth tokens revoke <id>.

Anti-bot resilience

Each job has four knobs that help against targets with active bot detection:

  • stealth_mode — turns on UA rotation (a small pool of recent real-browser UAs, deterministic per job id so the target sees a stable identity across runs) and injects a minimal Playwright stealth init script (hides navigator.webdriver, plausibly populates plugins/languages, fixes the headless-Chrome permissions.query quirk).
  • proxy_url — per-job http(s)://[user:pass@]host:port. Applies to both the static (Cheerio) and rendered (Playwright) paths.
  • pacing_min_ms / pacing_max_ms — uniform-random sleep between successive URLs in a multi-URL job. The existing per-host throttle still runs underneath.
  • Process-wide HTTPS_PROXY / HTTP_PROXY env — when set at server start, every outbound fetch (scraper, AI SDKs, webhook delivery) routes through the configured proxy. Per-job proxy_url overrides on the scrape path.
smartscrape jobs edit <job-id> --stealth --proxy-url http://user:pass@proxy:3128 --pacing-min 500 --pacing-max 1500

Failure classification + auto-pause

Failed runs are classified into one of seven buckets — timeout, blocked, parse_error, ai_error, network_error, quota_error, unknown — and the type lands on the run row (error_type), the jobs list (last_run_error_type), and any webhook payload. After three consecutive failures, a job is auto-paused (enabled=false) and a job_failed notification fires on the user's configured channels. Re-enable with the toggle endpoint or smartscrape jobs toggle <id>.

Webhooks

Configure a webhook on a job and SmartScrape POSTs the run results to that URL after every terminal run (completed or failed). Set the URL on create or edit:

smartscrape jobs edit <job-id> --webhook-url https://example.com/hook --webhook-secret '<long-secret>'
smartscrape jobs webhook test <job-id>     # send a synthetic payload now

Payload shape: { event, job_id, job_name, run_id, status, items_count, urls_scraped, tokens_used, error_message, started_at, completed_at, changes: { added, removed, changed }, items: [...] }. When a secret is configured, the request carries X-Webhook-Signature: sha256=<hmac> over the raw body and X-Webhook-Timestamp. Delivery retries up to 3 times with 1s → 4s backoff; the outcome is persisted on webhook_status, webhook_attempts, and webhook_last_error on the run row.

How a run works

  1. You hit Run now (or cron fires the job).
  2. Each URL is scraped. Auto mode tries Cheerio first; if the page looks empty it falls back to Playwright. The method is configurable per job.
  3. The HTML is sanitized — scripts, hidden elements, and instruction-shaped text stripped before anything is sent to the model.
  4. The AI gets a bounded extraction prompt with your schema and returns a JSON array. Output is validated (schema types, size cap, secret-leak guard).
  5. Each extracted item is hashed and compared to the previous run using your comparison key to determine added / removed / changed.
  6. Notification rules run against the diff. Email, Telegram, or both — based on the job's notify_channels and each rule's trigger.
  7. Data is stored, optionally pushed to Google Sheets, and the run is marked completed.

Architecture

            ┌──────────────┐
            │  React SPA   │  Vite + Tailwind, JWT in localStorage
            │  (port 5173) │
            └──────┬───────┘
                   │ /api/* (CORS locked to APP_URL)
            ┌──────▼───────┐         ┌─────────────────┐
            │ Express API  │ enqueue │  BullMQ worker  │
            │  (port 3000) │────────►│ (in same proc)  │
            └──┬─────────┬─┘         └────────┬────────┘
               │         │                    │ each tick:
               │         │                    │   scrape → AI extract →
               │         │                    │   diff → notify → store
               │         │                    │
        ┌──────▼─┐    ┌──▼────┐         ┌────▼─────────┐
        │Postgres│    │ Redis │         │  AI provider │
        │   16   │    │   7   │         │  (user key)  │
        └────────┘    └───────┘         └──────────────┘

Server (server/) — Express + TypeScript. Auth (bcrypt + JWT + hashed refresh tokens), scrape job CRUD, an AI setup wizard, the runner (BullMQ scheduler + Cheerio/Playwright fetcher + AI extractor + change detector + notifier), and Google Sheets / Telegram / email integrations.

Client (client/) — React + Vite + Tailwind SPA. Routes for the dashboard, jobs list, job detail with diff view, new-job wizard, settings, and notification history. Talks to /api via the Vite dev proxy in development; same-origin in production.

Database (server/migrations/) — Postgres. Users, refresh tokens, encrypted API keys, scrape jobs, runs, extracted data (with SHA-256 hashes for diffing), notification log, Google connections, settings, AI setup logs.

Queue — Redis-backed BullMQ. Manual triggers go through enqueueNow; scheduled jobs go through upsertJobScheduler keyed per job. The worker runs in the same Node process as the API in v1; splitting it is a one-line change for production.

Scraping pipeline — auto mode tries Cheerio first; if visible-text length is below threshold (likely an SPA), it falls back to a headless Chromium context with route interception that aborts requests resolving to private IPs (SSRF defense). HTML is sanitized before reaching the model.

Extraction pipeline — provider-agnostic chat call (ai-providers.ts) → JSON parse with fence stripping → schema type validation → secret-leak guard (split into floored API keys + unfloored emails) → size cap → HTML-escape sanitize → SHA-256 hash → store. The validator (validateExtractedItems) is pure and unit-tested.

Change detection — items matched across runs by user-chosen comparison key. New / removed / changed buckets feed the rule engine, which renders templated messages and dispatches via email and/or Telegram.

Concepts glossary

  • Job — a URL (or up to 10), what to extract, how often, and what to do on changes.
  • Run — a single execution of a job. Every job has a run history; each run is reproducible and diffable.
  • Schema{ field_name: "string" | "number" | "boolean" | "array" | "object" }. Types are enforced on AI output.
  • Comparison key — the field that uniquely identifies an item across runs (url, sku, id). Without one, change detection can't match items, so you only get all-or-nothing "data changed."
  • Notification ruleany_change, new_items, removed_items, field_threshold, field_change. Each can have a templated message with {field_name}, {old}, {new}, {count}, {url}.

Security posture

  • User passwords: bcrypt, 12+ rounds
  • JWT: 15-minute access token + 7-day refresh token, refresh tokens stored hashed (SHA-256) so they're revocable
  • Provider keys and Google OAuth tokens: AES-256-GCM encrypted at rest; encryption key is a required env var
  • AI extraction: data-boundary-marked prompts, output validated (schema, size, secret leak), never trusts HTML content as instructions
  • User URLs: private / loopback / metadata ranges rejected (SSRF)
  • Rate limits: 5/min on auth entry routes, 100 runs/user/day
  • HTTP: Helmet defaults, CORS locked to APP_URL

Project layout

smartscrape/
  server/    # Express + TS API, migrations, BullMQ worker
  client/    # React + Vite + Tailwind SPA + Playwright smoke tests
  cli/       # Headless CLI client (commander + native fetch)
  docs/      # Screenshots, assets
  docker-compose.yml     # Postgres 16 + Redis 7 for local dev

License

MIT.

About

AI-powered self-hosted web scraping with structured extraction, change detection, and notifications

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors