Skip to content

Add opt-in browser-scraper Worker for JS-rendered monitor pages#14

Merged
valuecodes merged 4 commits intomainfrom
playwright-scraper
May 2, 2026
Merged

Add opt-in browser-scraper Worker for JS-rendered monitor pages#14
valuecodes merged 4 commits intomainfrom
playwright-scraper

Conversation

@valuecodes
Copy link
Copy Markdown
Owner

What

Adds a separate apps/browser-scraper Worker that runs Playwright on Cloudflare Browser Rendering, called from operator via a Service Binding. Native fetch() stays the default; per-monitor use_browser flag opts into the browser path. Includes the previously committed logger metadata-key redaction.

  • New Worker switch-operator-browser-scraper with workers_dev: false (only reachable via service binding); SSRF guards on top-level URL, post-redirect URL, and every subresource via page.route()
  • Operator: new use_browser column (migration 0005), BROWSER_SCRAPER service binding, scrapeUrl switched to options-object signature
  • Shared packages/url-validator workspace, used by both Workers
  • CI: deploy step runs browser-scraper before operator so the binding resolves on first deploy

How to test

  • pnpm typecheck && pnpm lint && pnpm format:check && pnpm test — all green (210 tests)
  • Manual end-to-end requires deploying browser-scraper (no local emulator for Browser Rendering); see plan file for the migration → deploy → wrangler tail flow

Security review

  • Auth / network: new Worker is service-binding-only (workers_dev: false); SSRF validator blocks private/loopback/non-HTTPS on three paths (initial URL, post-redirect, subresources); 2 MB body cap and 15s navigation timeout protect the daily browser budget.
  • Deps: @cloudflare/playwright@1.2.0 added to the new Worker only; operator bundle unchanged.

Copilot AI review requested due to automatic review settings May 2, 2026 05:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new apps/browser-scraper Cloudflare Worker so the operator can optionally scrape JS-rendered pages through Browser Rendering instead of the existing native fetch() path. It also factors URL validation into a shared workspace package, adds the operator-side service binding and schedule flag plumbing, and includes the logger metadata redaction update.

Changes:

  • Adds a new @repo/browser-scraper Worker that uses Playwright on Cloudflare Browser Rendering, plus its config, package metadata, and tests.
  • Moves URL safety checks into a shared @repo/url-validator package and switches operator scraping/validation code to use it.
  • Adds operator-side browser-scraper binding/migration wiring and updates CI deploy order so the new service exists before operator deploys.

Reviewed changes

Copilot reviewed 25 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pnpm-lock.yaml Locks new workspace/package dependencies for browser-scraper and url-validator.
packages/url-validator/vitest.config.ts Adds Vitest config for the shared URL validator package.
packages/url-validator/tsconfig.json Adds TS config for the shared URL validator package.
packages/url-validator/src/url-validator.ts Introduces shared source URL validation logic used by operator and browser-scraper.
packages/url-validator/src/url-validator.test.ts Adds unit tests for the shared URL validator.
packages/url-validator/package.json Defines the new shared package metadata and scripts.
packages/url-validator/eslint.config.ts Adds ESLint config for the new shared package.
packages/logger/src/logger.ts Adds top-level metadata redaction for sensitive log keys.
packages/logger/src/__tests__/logger.test.ts Tests the new logger redaction behavior.
apps/operator/wrangler.jsonc Wires the browser-scraper service binding into the operator Worker.
apps/operator/src/types/env.ts Extends operator environment typings with the browser-scraper binding.
apps/operator/src/services/scrape.ts Refactors scraping to support native fetch or browser-scraper via options.
apps/operator/src/services/scrape.test.ts Adds tests for the new browser-scraper scrape path.
apps/operator/src/scheduled.ts Passes per-schedule browser options into monitor execution.
apps/operator/src/modules/telegram/controller.ts Switches monitor URL validation import to the shared package.
apps/operator/src/db/schema.ts Adds the use_browser column to schedules.
apps/operator/package.json Adds the shared URL validator dependency to operator.
apps/operator/migrations/meta/_journal.json Registers the new migration in Drizzle metadata.
apps/operator/migrations/meta/0005_snapshot.json Captures the schema snapshot including use_browser.
apps/operator/migrations/0005_nervous_scorpion.sql Adds the SQL migration for the new schedule column.
apps/browser-scraper/wrangler.jsonc Configures the new browser-scraper Worker and Browser Rendering binding.
apps/browser-scraper/tsconfig.json Adds TS config for the new Worker.
apps/browser-scraper/src/services/playwright.ts Implements Playwright-based rendering and subresource URL checks.
apps/browser-scraper/src/services/playwright.test.ts Adds unit tests for Playwright rendering behavior.
apps/browser-scraper/src/index.ts Exposes the browser-scraper Worker request handler.
apps/browser-scraper/package.json Defines package metadata/scripts/dependencies for the new Worker.
apps/browser-scraper/eslint.config.ts Adds ESLint config for the new Worker.
.github/workflows/main.yml Updates deploy ordering so browser-scraper deploys before operator.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +27
useBrowser: int("use_browser", { mode: "boolean" })
.notNull()
.default(false),
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional and called out as out-of-scope in the PR plan: enabling a schedule for the browser path is currently a manual step (wrangler d1 execute … "UPDATE schedules SET use_browser = 1 WHERE id = '…'"). Wiring it into createScheduleSchema / the OpenAI create_schedule tool / ScheduleService.create is a deliberate follow-up so we can validate the daily browser-time budget before exposing the toggle to LLM-generated schedules.

Comment on lines +79 to +81
const html = await page.content();
const truncated = html.length > MAX_HTML_CHARS;
const finalHtml = truncated ? html.slice(0, MAX_HTML_CHARS) : html;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. page.content() has no streaming variant on the @cloudflare/playwright binding, so the only enforcement point the worker exposes is post-serialization. The cap is therefore on the response payload we send back over the service binding (and ultimately into the AI prompt), not on browser memory — the 60s hard timeout and 10 min/day daily budget bound the memory side. Char vs byte: agreed it's not a hard byte cap, but UTF-8 worst case is ~4× and downstream convertContent handles oversized input fine, so the slack is acceptable here.

Comment thread apps/operator/src/services/scrape.ts Outdated
Comment thread apps/browser-scraper/src/index.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d65cd20a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +110 to 111
- name: Deploy operator to Cloudflare Workers
run: pnpm --filter @repo/operator run deploy
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run DB migration before deploying the operator

This workflow deploys the updated operator immediately but never applies migration 0005_nervous_scorpion.sql, even though the commit adds a new schedules.use_browser column that is now part of the runtime schema (apps/operator/src/db/schema.ts) and read during scheduled execution. On environments where db:migrate:remote has not been run yet, the newly deployed worker can fail with no such column: use_browser as soon as it queries schedules. Add a migration step (for example pnpm --filter @repo/operator db:migrate:remote) before the operator deploy step.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dedicated workflow at .github/workflows/migrations.yml already applies D1 migrations on push to main whenever apps/operator/migrations/** changes (wrangler d1 migrations apply switch-operator-db --remote). Duplicating that step inside the main deploy job would either run it twice or couple the two workflows. There is a known race window where the deploy can land before the migration workflow finishes; the current mitigation is to merge migration-only changes ahead of the feature commit when the gap matters. Happy to add an explicit guard if the race becomes a real problem.

@valuecodes valuecodes merged commit 33aa260 into main May 2, 2026
7 checks passed
@valuecodes valuecodes deleted the playwright-scraper branch May 2, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants