Migrate Fetch Details step (was: Scrape) to Pydantic AI agent by rafacm · Pull Request #107 · rafacm/ragtime

rafacm · 2026-04-28T11:41:55Z

Summary

Renames pipeline step Scrape → Fetch Details (full rename, BREAKING for env vars + status string).
Migrates the step from the legacy LLMProvider factory to a Pydantic AI agent (episodes/agents/fetch_details.py) — single structured-output call, no tools yet.
Introduces shared episodes/agents/_model.py build_model(model_string, api_key) helper, used by both the new fetch_details agent and the existing recovery agent.

Closes #106. Plan: doc/plans/2026-04-28-fetch-details-agent.md. Feature doc: doc/features/2026-04-28-fetch-details-agent.md.

Why this shape

This is the first of a planned series of step → agent migrations. The Scrape step was structurally simple (one LLM call, no tools), making it a clean SDK swap that locks in the agent shape before subsequent PRs grow it (browser tools absorbed from the recovery agent, eventually deprecating the recovery cluster altogether).

Module purity rule: episodes/agents/fetch_details.py imports only Pydantic AI, Pydantic, stdlib, and _model.py — no Django, no DBOS, no episodes.models. Tests can do from episodes.agents.fetch_details import run, EpisodeDetails without booting Django.

What changed

Status enum: SCRAPING = "scraping" → FETCHING_DETAILS = "fetching_details". Migration 0021_rename_scraping_to_fetching_details rewrites every persisted literal in Episode.status, ProcessingStep.step_name, PipelineEvent.step_name, ProcessingRun.resumed_from_step.
Files: episodes/scraper.py → episodes/fetch_details_step.py. Recovery agent files renamed agents/{agent,browser,deps,tools,resume}.py → recovery_*.py (transitional prefix ahead of deletion).
Function: scrape_episode → fetch_episode_details. Telemetry span name + recovery can_handle() literal flipped.

Env vars (Convention B):

Removed	Added
`RAGTIME_SCRAPING_PROVIDER`	(provider encoded in model string prefix)
`RAGTIME_SCRAPING_MODEL=gpt-4.1-mini`	`RAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-mini`
`RAGTIME_SCRAPING_API_KEY`	`RAGTIME_FETCH_DETAILS_API_KEY`

Configure wizard skips writing _PROVIDER for Convention B subsystems (those whose fields list omits PROVIDER) while continuing to share the API key across the LLM group.

get_scraping_provider deleted from episodes/providers/factory.py.
Step orchestrator delegates the LLM call to the agent via asyncio.run. Preserves fast-path skip when required fields are pre-filled, empty-field-only merge, save-before-fail-step ordering for recovery interaction.
Recovery agent's local _build_model replaced by _model.build_model shared call.

PR shape

Two commits, per the plan's "Option I":

Rename — files, status enum, env vars, configure wizard, recovery can_handle literal, status data migration, README/AGENTS.md/doc/README.md, plan + planning session transcript.
Agent migration — introduce _model.py and agents/fetch_details.py, refactor fetch_details_step.py to call the agent, refactor recovery agent to share _model.py, delete get_scraping_provider, rewrite tests, feature doc + implementation session transcript + CHANGELOG.

Manual diagram updates — deferred

The Excalidraw diagrams cannot be auto-regenerated and are deferred to the follow-up PR that migrates the Download step to its own agent — both PRs will need the same set of edits, so it's cheaper to do them once together.

~~ragtime-processing-pipeline.{excalidraw,svg} — rename "Scrape" box to "Fetch Details".~~ (deferred to Download-agent PR)
~~ragtime-processing-pipeline-with-recovery.{excalidraw,svg} — same rename.~~ (deferred)
~~ragtime-recovery.{excalidraw,svg} — review for any "scraping" labels and update.~~ (deferred)

Test plan

uv run python manage.py test — full suite, 349 tests pass.
uv run python manage.py check — clean.
uv run python manage.py makemigrations --dry-run — no extra migrations needed (model state matches 0021).
On-device: submit a fresh episode, watch it move pending → queued → fetching_details → downloading → .... With RAGTIME_OTEL_COLLECTORS=jaeger the trace span appears as fetch_episode_details.
On-device: trigger a fetch_details failure (e.g. an invalid URL) with RAGTIME_RECOVERY_AGENT_ENABLED=true and confirm AgentStrategy.can_handle matches the new literal.
~~Diagrams updated~~ — deferred to Download-agent PR.

🤖 Generated with Claude Code

Full rename ahead of the Pydantic AI agent migration. Status enum ``scraping`` → ``fetching_details``; ``episodes/scraper.py`` → ``episodes/fetch_details_step.py``; ``episodes/agents/{agent,browser,deps, tools,resume}.py`` → ``recovery_*.py`` (recovery agent files marked transitional ahead of deletion). Function ``scrape_episode`` → ``fetch_episode_details``; telemetry span name + recovery ``can_handle()`` literal flipped accordingly. Env-var convention shifts to Convention B for the new step: ``RAGTIME_SCRAPING_PROVIDER`` is dropped (provider encoded in the model string prefix), ``RAGTIME_SCRAPING_MODEL`` → ``RAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-mini``, ``RAGTIME_SCRAPING_API_KEY`` → ``RAGTIME_FETCH_DETAILS_API_KEY``. Configure wizard updated to skip ``_PROVIDER`` for Convention B subsystems while continuing to share the API key across the LLM group. Status data migration ``0021_rename_scraping_to_fetching_details`` rewrites every ``"scraping"`` literal in ``Episode.status``, ``ProcessingStep.step_name``, ``PipelineEvent.step_name``, and ``ProcessingRun.resumed_from_step``, then alters the ``Episode.status`` choices. ``get_scraping_provider`` is kept (transitional) and rewired to read the new env vars + parse the ``provider:model`` prefix; it is replaced by the ``fetch_details`` agent and deleted in the follow-up commit. Docs (README, AGENTS.md, doc/README.md) updated for the rename. Excalidraw diagrams flagged in the PR description for manual update — they cannot be auto-regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the legacy ``LLMProvider`` call with a Pydantic AI agent. Adds ``episodes/agents/fetch_details.py`` (single structured-output ``Agent.run`` call, no tools yet) and ``episodes/agents/_model.py`` (pure ``build_model(model_string, api_key)`` helper). Recovery agent refactored to share ``_model.build_model`` instead of its own ``_build_model``. ``get_scraping_provider`` deleted from ``episodes/providers/factory.py``. The fetch_details agent module imports only Pydantic AI, Pydantic, stdlib, and ``_model`` — Django imports are inside the lazy ``get_agent()`` factory so the module is bootable from a bare interpreter for unit and eval tests: from episodes.agents.fetch_details import run, EpisodeDetails ``EpisodeDetails`` carries validators for ``published_at`` (``YYYY-MM-DD``, fall back to ``None``) and ``language`` (ISO 639-1 regex, fall back to ``None``). URLs stay ``str | None`` to preserve the relative-URL tolerance from the previous prompt. The step orchestrator (``episodes/fetch_details_step.py``) now calls ``asyncio.run(fetch_details.run(html))`` from its sync DBOS-step body and copies fields with the existing empty-field-only merge. All other behaviors (HTML fetch + clean, fast-path skip, completeness check, save-before-fail-step ordering) are preserved. Tests rewritten around ``Agent.override(model=TestModel(...))``: each LLM-path test wraps the call site in an ``agent.override`` context that feeds a deterministic ``EpisodeDetails`` payload through the validator. The fast-path skip test patches ``_run_agent_sync`` directly and asserts it was not called. Adds a small ``EpisodeDetailsValidatorTests`` class with pure-Pydantic tests (no Django, no agent run). Total: 338 tests pass. Includes feature doc, implementation session transcript, and a ``**BREAKING**`` CHANGELOG entry covering both commits of this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rafacm

Review comments from local pass. Full test suite passed locally, but these issues look worth fixing before merge.

Three fixes from the local review pass on PR #107: **Drop eager re-export from ``episodes/agents/__init__.py``.** ``from .recovery import run_recovery_agent`` was pulling Django (via ``episodes.models``) and Playwright into every importer of the package, defeating the module-purity contract documented on ``fetch_details.py``. ``episodes.agents`` is now an empty package; ``recovery.py`` and ``workflows.py`` import the recovery entry point by full path (``from .agents.recovery import run_recovery_agent``). Verified ``import episodes.agents.fetch_details`` no longer loads recovery or Django. Test patches updated to target the canonical location ``episodes.agents.recovery.run_recovery_agent``. **Wizard syncs Convention B model prefix to the shared provider.** In shared mode for the LLM group, picking a non-default provider (e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at its stale ``openai:`` prefix, producing an invalid key/model pair. ``configure._prompt_shared_system`` now strips the existing prefix from the displayed default and re-prepends ``<shared_provider>:`` so the proposal tracks the shared provider. Idempotent across re-runs. **``_model.build_model`` no longer mutates ``os.environ``.** The previous ``os.environ.setdefault(env_var, api_key)`` had two defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` silently won over the configured ``RAGTIME_*_API_KEY``, and (b) whichever agent built first locked its key into the process for all subsequent agents — a real cross-agent credential leak in long-lived workers that run both ``fetch_details`` and ``recovery``. The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel`` instances when the SDK is installed (symmetric to the existing OpenAI branch), threading the API key through the provider constructor. Unknown providers and missing SDKs fall through to the raw model string with a warning — no global state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two follow-ups on top of b13d75f: **Wizard proposes a provider-appropriate model suffix.** The previous v1 fix synced only the prefix to the shared provider, so picking ``anthropic`` produced ``anthropic:gpt-4.1-mini`` — invalid on Enter. ``CONVENTION_B_PROVIDER_DEFAULTS`` now maps each known provider to a default model name (openai → gpt-4.1-mini, anthropic → claude-sonnet-4-20250514, google → gemini-2.5-pro), and the wizard's shared-mode loop rebuilds the proposal end-to-end. When the existing prefix already matches the shared provider, the user's suffix is preserved (idempotent across re-runs); otherwise the provider's default suffix is swapped in. **``_model.build_model`` falls back to env-scoped resolution when the provider submodule can't be imported.** The previous v1 fix returned the raw model string in this case, which let ``Agent(model_string)`` resolve through Pydantic AI's ``infer_model`` and pick up any ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` — silently overriding the configured ``RAGTIME_*_API_KEY``. The helper now wraps the late-resolution path in ``_temp_env(env_var, api_key)``, a ``contextlib.contextmanager`` that sets the env var, calls ``infer_model`` (which captures the key into the provider's HTTP client during construction), and restores the prior value (or removes the var if it didn't exist) — even on exception. Concurrent / subsequent agents with different configured keys can no longer cross-contaminate. For unknown providers we still return the raw string with a warning, because we don't know which env var to scope. Adds ``episodes/tests/test_agents_model.py`` (10 tests covering the OpenAI / Google concrete branches, ``_temp_env`` set/restore/exception behavior, the Anthropic fallback's env-var restoration on ``ImportError``, and the unknown-provider warning path) and one new configure-wizard test asserting the Anthropic shared-provider scenario produces ``anthropic:claude-sonnet-4-20250514``. Total: 349 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three fixes from the local review pass on PR #107: **Drop eager re-export from ``episodes/agents/__init__.py``.** ``from .recovery import run_recovery_agent`` was pulling Django (via ``episodes.models``) and Playwright into every importer of the package, defeating the module-purity contract documented on ``fetch_details.py``. ``episodes.agents`` is now an empty package; ``recovery.py`` and ``workflows.py`` import the recovery entry point by full path (``from .agents.recovery import run_recovery_agent``). Verified ``import episodes.agents.fetch_details`` no longer loads recovery or Django. Test patches updated to target the canonical location ``episodes.agents.recovery.run_recovery_agent``. **Wizard syncs Convention B model prefix to the shared provider.** In shared mode for the LLM group, picking a non-default provider (e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at its stale ``openai:`` prefix, producing an invalid key/model pair. ``configure._prompt_shared_system`` now strips the existing prefix from the displayed default and re-prepends ``<shared_provider>:`` so the proposal tracks the shared provider. Idempotent across re-runs. **``_model.build_model`` no longer mutates ``os.environ``.** The previous ``os.environ.setdefault(env_var, api_key)`` had two defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` silently won over the configured ``RAGTIME_*_API_KEY``, and (b) whichever agent built first locked its key into the process for all subsequent agents — a real cross-agent credential leak in long-lived workers that run both ``fetch_details`` and ``recovery``. The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel`` instances when the SDK is installed (symmetric to the existing OpenAI branch), threading the API key through the provider constructor. Unknown providers and missing SDKs fall through to the raw model string with a warning — no global state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rafacm and others added 2 commits April 28, 2026 13:34

rafacm commented Apr 28, 2026

View reviewed changes

Comment thread episodes/agents/__init__.py Outdated

Comment thread core/management/commands/configure.py Outdated

Comment thread episodes/agents/_model.py Outdated

rafacm and others added 2 commits April 28, 2026 14:00

rafacm merged commit 1629383 into main Apr 28, 2026
1 check passed

rafacm deleted the feature/fetch-details-agent branch April 28, 2026 14:32

This was referenced Apr 28, 2026

Download as agent + podcast-index fallback + DBOS state-of-truth #108

Closed

Download as agent + podcast-index fallback + DBOS state-of-truth #109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate Fetch Details step (was: Scrape) to Pydantic AI agent#107

Migrate Fetch Details step (was: Scrape) to Pydantic AI agent#107
rafacm merged 4 commits into
mainfrom
feature/fetch-details-agent

rafacm commented Apr 28, 2026 •

edited

Loading

Uh oh!

rafacm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafacm commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this shape

What changed

PR shape

Manual diagram updates — deferred

Test plan

Uh oh!

rafacm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rafacm commented Apr 28, 2026 •

edited

Loading