Skip to content

[security] fix: guard rss transcript fetches#239

Open
Hinotoi-agent wants to merge 1 commit into
steipete:mainfrom
Hinotoi-agent:fix/rss-transcript-fetch-guard
Open

[security] fix: guard rss transcript fetches#239
Hinotoi-agent wants to merge 1 commit into
steipete:mainfrom
Hinotoi-agent:fix/rss-transcript-fetch-guard

Conversation

@Hinotoi-agent
Copy link
Copy Markdown
Contributor

@Hinotoi-agent Hinotoi-agent commented Jun 5, 2026

Summary

This PR hardens RSS podcast transcript fetching so feed-controlled <podcast:transcript url="..."> values cannot be used to fetch loopback, link-local, private-network, or redirected local targets from the host running Summarize.

The patch keeps the existing podcast transcript feature, but moves the transcript URL across an explicit network safety boundary before fetching:

  • validates transcript URL schemes before requests are made
  • blocks localhost, loopback, link-local, private, unspecified, multicast, and reserved address ranges
  • resolves hostnames and rejects local/private DNS answers before fetch
  • fetches with manual redirects and revalidates every redirect target
  • pins native undici fetch resolution to the already validated DNS answers to reduce DNS rebinding exposure

Security issues covered

  • RSS podcast transcript SSRF via attacker-controlled <podcast:transcript url="..."> values in feed XML.

Before this PR

  • The RSS podcast transcript parser selected a transcript URL from feed XML.
  • The selected URL was passed directly to fetch with automatic redirect following.
  • Local/private URL literals, hostnames resolving to private addresses, and redirects to local/private addresses were not rejected before the host made the request.

After this PR

  • Transcript URLs must be http: or https:.
  • Localhost and local/private IP destinations are rejected before fetch.
  • DNS answers are checked before fetch, and any local/private answer blocks the request.
  • Redirect targets are revalidated before following.
  • Native fetches use an undici dispatcher pinned to the validated address set for hostname targets.

Why this matters

Podcast/RSS feed XML can be controlled by the feed publisher. If a user or daemon summarizes an attacker-controlled feed, the feed can direct the host to fetch transcript content from services that are not reachable from the public internet, such as loopback admin interfaces, RFC1918 services, or link-local metadata endpoints.

That creates an SSRF boundary crossing from untrusted feed metadata into host-side network access. Depending on deployment, this can expose internal service responses or induce requests to internal infrastructure.

Runtime behavior proof

A real feed/runtime proof was run against the patched RSS transcript path (tryFetchTranscriptFromFeedXml) with a synthetic RSS feed containing <podcast:transcript> entries. The proof starts a loopback HTTP service for the blocked case and verifies the service is not contacted, then verifies a public transcript URL still succeeds.

Command run:

pnpm exec tsx /tmp/summarize-rss-transcript-runtime-proof.ts

Redacted output:

RSS transcript runtime proof (redacted)
blocked_local_private_transcript:
  url: http://127.0.0.1:[REDACTED_PORT]/private-transcript.txt
  result: blocked/no transcript returned
  notes: RSS <podcast:transcript> fetch failed: RSS transcript URL resolves to a blocked local network address
  local_service_hits: 0
public_transcript:
  url: https://raw.githubusercontent.com/steipete/summarize/main/README.md
  result: success/transcript returned
  transcript_url: https://raw.githubusercontent.com/steipete/summarize/main/README.md
  text_prefix: # Summarize 📝 — Chrome Side Panel + CLI Fast summaries from URLs, files, and m
  notes: Used RSS <podcast:transcript> (skipped Whisper)

This gives after-fix behavior proof for both sides of the boundary: local/private transcript URLs are rejected before request dispatch (local_service_hits: 0), while a normal public HTTP(S) transcript URL still returns transcript text.

Guard boundary note

This PR keeps the new RSS transcript guard in packages/core/src/content/transcript/providers/podcast/rss-transcript.ts separate from the existing daemon URL fetch guard in src/daemon/url-fetch-guard.ts. That separation is intentional for this patch because the trust boundary here is the core podcast transcript provider: untrusted feed metadata crosses directly into core transcript fetching and must be protected even when the daemon layer is not involved.

The daemon guard remains the boundary for daemon-owned URL fetches. Maintainers may still choose to extract a shared helper later to reduce long-term drift, but this PR does not assume the daemon guard automatically protects core transcript fetches.

Attack flow

  1. Attacker publishes or controls a podcast/RSS feed.
  2. The feed includes a <podcast:transcript> entry whose url points to a local/private target, or to a public URL that redirects to one.
  3. Summarize selects that transcript URL while processing the feed.
  4. Vulnerable code fetches the URL from the host environment.
  5. The response is parsed as transcript text and can be surfaced through the transcript/summarization flow.

Affected code

  • packages/core/src/content/transcript/providers/podcast/rss-transcript.ts
    • RSS <podcast:transcript> extraction and transcript fetch path.

Root cause

  • Feed-controlled transcript URLs crossed directly into host-side fetch.
  • Redirect handling was delegated to automatic fetch behavior, so redirected targets were not revalidated.
  • Hostname targets were not resolved and checked against local/private network ranges before request dispatch.

CVSS assessment

  • Estimated severity: High
  • Estimated CVSS v3.1: 8.1
  • Estimated vector: AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:N

The exact score depends on whether maintainers treat arbitrary podcast/RSS feeds as untrusted remote input in the affected deployment. The key security property is that untrusted feed metadata could cause host-side requests to internal/local network destinations.

Safe reproduction steps

A bounded regression test is included instead of probing real local services:

  1. Construct a feed containing <podcast:transcript url="http://127.0.0.1:8080/admin/transcript.txt" />.
  2. Pass the feed through tryFetchTranscriptFromFeedXml with a mocked fetch implementation.
  3. Verify the result is null, the mocked fetch implementation is not called, and the rejection note mentions a blocked local network target.

Additional tests cover DNS answers that resolve to private/link-local addresses, redirect revalidation, and pinned dispatcher use for public hostname targets.

Expected vulnerable behavior

On vulnerable code, the loopback/private transcript URL path would be passed to fetch, and automatic redirects could move an initially public transcript URL to a local/private destination without a second validation step.

Changes in this PR

  • Adds transcript URL target validation before fetch.
  • Adds IPv4/IPv6 local/private/reserved range checks.
  • Adds hostname resolution before transcript fetch and rejects unsafe DNS answers.
  • Switches transcript fetching to manual redirect handling and validates each redirect target.
  • Adds an undici dispatcher for native fetch so resolved hostnames use the validated address set.
  • Adds security regression tests for direct URL literals, DNS resolution, redirects, and pinned dispatch.

Files changed

  • packages/core/src/content/transcript/providers/podcast/rss-transcript.ts
    • Adds the URL/network guard, redirect revalidation, and pinned dispatcher logic.
  • tests/security.rss-transcript-ssrf.test.ts
    • Adds focused SSRF regression coverage.
  • packages/core/package.json
    • Adds undici for pinned dispatcher support.
  • pnpm-lock.yaml
    • Updates the lockfile for the new dependency.

Maintainer impact

  • Legitimate public HTTP/HTTPS transcript URLs continue to work.
  • Transcript URLs that resolve to local/private networks are intentionally rejected.
  • The change is scoped to RSS podcast transcript fetching and does not change general feed parsing semantics.

Fix rationale

The safe boundary is at the point where feed-controlled metadata becomes a host-side network request. Checking only the original URL is not enough because DNS resolution and redirects can change the eventual target. This PR validates the original target, validates DNS answers, pins the native fetch to those answers where possible, and repeats validation across redirects.

Type of change

  • Security hardening
  • Bug fix
  • Regression tests
  • Documentation-only change

Test plan

Local validation completed:

  • pnpm install
  • pnpm -s test tests/security.rss-transcript-ssrf.test.ts
    • 4 tests passed.
  • pnpm -s test tests/transcript.podcast-provider.podcast-transcript.test.ts
    • 2 tests passed.
  • git diff --check
  • pnpm -s format:check
    • All matched files use the correct format; 969 files checked.
  • pnpm -s lint
    • 0 warnings and 0 errors; 903 files checked.
  • pnpm -s typecheck
  • pnpm -s build

Local note: install/build contexts emitted the existing engine warning because this machine is using Node v22.22.2 while the package declares Node >=24. The commands above completed successfully on this local environment.

Disclosure notes

This PR is intentionally limited to RSS podcast transcript URL fetching. It does not claim to audit every outbound network request in the repository. The patch focuses on the feed-controlled transcript URL path where untrusted RSS metadata can become host-side network access.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 5, 2026

Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 8:55 PM ET / 00:55 UTC.

Summary
The PR adds RSS transcript URL, DNS, IP range, redirect, and pinned-dispatch validation, adds undici to the core package, and adds focused SSRF regression tests.

Reproducibility: yes. from source inspection: current main fetches feed-controlled RSS transcript URLs directly with automatic redirects, and the PR's tests and runtime proof exercise the blocked loopback/private and allowed public paths. I did not run the exploit locally because this review is read-only.

Review metrics: 2 noteworthy metrics.

  • Files touched: 4 files changed. The change spans core runtime code, core package metadata, the lockfile, and a new security test file.
  • Core dependency delta: 1 dependency added. Adding undici to @steipete/summarize-core is supply-chain and package-surface relevant even though the root package already uses the same dependency.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Record maintainer acceptance of the local/private transcript blocking behavior and the scoped duplicate guard, or request shared guard extraction before merge.

Risk before merge

  • [P1] Existing users whose RSS feeds intentionally point transcript URLs at localhost, link-local, or private-network hosts will lose that transcript path after merge; that appears intentional for SSRF hardening but is still compatibility-sensitive.
  • [P1] The new core guard closely mirrors src/daemon/url-fetch-guard.ts, so future security-boundary fixes may drift unless maintainers accept short-term duplication or extract shared logic.
  • [P1] Adding undici to @steipete/summarize-core changes the core dependency surface, though it matches the root package's existing undici version.

Maintainer options:

  1. Accept the scoped guard now (recommended)
    Merge the focused RSS transcript hardening with the known local/private transcript compatibility break and track shared guard extraction separately if maintainers want it.
  2. Extract a shared guard before merge
    Ask the branch to factor the daemon and core transcript URL guard logic through one shared helper with both daemon and RSS transcript regression coverage.
  3. Pause for compatibility policy
    Pause the PR if private-network RSS transcript URLs are an intended supported workflow that needs an opt-in policy instead of unconditional blocking.

Next step before merge

  • [P2] The remaining action is maintainer security and compatibility judgment, not an automated repair task.

Security
Cleared: The diff is security hardening with focused tests and runtime proof; no concrete new secret, CI, publish, or supply-chain defect was found.

Review details

Best possible solution:

Land the scoped RSS transcript guard once maintainers accept the fail-closed compatibility impact and either accept the duplicated guard as a short-term boundary or request a shared helper first.

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main fetches feed-controlled RSS transcript URLs directly with automatic redirects, and the PR's tests and runtime proof exercise the blocked loopback/private and allowed public paths. I did not run the exploit locally because this review is read-only.

Is this the best way to solve the issue?

Yes, the PR is a narrow and maintainable security hardening path for the core RSS transcript boundary, but maintainers should explicitly accept the private-network compatibility break and duplicated guard logic before merge.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f40e26b1330e.

Label changes

Label changes:

  • add merge-risk: 🚨 compatibility: The patch intentionally rejects local, link-local, and private-network transcript URLs that existing RSS workflows may have used.
  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes redacted terminal/runtime output showing the blocked loopback transcript path with zero local service hits and a successful public transcript fetch.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes redacted terminal/runtime output showing the blocked loopback transcript path with zero local service hits and a successful public transcript fetch.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.
  • remove rating: 🦪 silver shellfish: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P2: This is a focused security hardening PR for a bounded RSS transcript fetch path, with tests and runtime proof but no emergency repository-wide outage signal.
  • merge-risk: 🚨 compatibility: The patch intentionally rejects local, link-local, and private-network transcript URLs that existing RSS workflows may have used.
  • merge-risk: 🚨 security-boundary: The PR changes SSRF boundary behavior and duplicates similar daemon guard logic, so maintainers should review the boundary choice beyond green tests.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes redacted terminal/runtime output showing the blocked loopback transcript path with zero local service hits and a successful public transcript fetch.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes redacted terminal/runtime output showing the blocked loopback transcript path with zero local service hits and a successful public transcript fetch.
Evidence reviewed

What I checked:

Likely related people:

  • Peter Steinberger: Blame and git history show the current RSS transcript provider, podcast transcript refactors, RSS transcript tests, and daemon URL guard in Peter-authored commits. (role: feature-history owner; confidence: high; commits: fcf8c8e5e98d, d1dbf0c396f8, dc6101c0bca8; files: packages/core/src/content/transcript/providers/podcast/rss-transcript.ts, packages/core/src/content/transcript/providers/podcast.ts, tests/transcript.podcast-provider.podcast-transcript.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels Jun 5, 2026
@Hinotoi-agent
Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

I updated the PR body with follow-up proof for the P1 items on head 70d7e0e41da52256f2ed3139b4ef98b01699af82:

  • Added redacted runtime/feed output showing a loopback/private <podcast:transcript> URL is blocked before request dispatch (local_service_hits: 0).
  • Added redacted runtime/feed output showing a public transcript URL still succeeds through the same core RSS transcript path.
  • Clarified the boundary between the new core RSS transcript guard and the existing daemon URL fetch guard: they are intentionally separate in this PR because core transcript fetching must be protected even when the daemon layer is not involved; maintainers can still choose a shared helper later to reduce drift.

Focused local validation rerun after rehydrating this PR:

pnpm exec tsx /tmp/summarize-rss-transcript-runtime-proof.ts
pnpm -s test tests/security.rss-transcript-ssrf.test.ts
pnpm -s test tests/transcript.podcast-provider.podcast-transcript.test.ts
git diff --check

All of the above completed successfully locally. The local machine is still Node v22.22.2 while the package declares Node >=24, so the engine warning remains environmental rather than a test failure.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 6, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant