Migrate Fetch Details step (was: Scrape) to Pydantic AI agent#107
Merged
Conversation
Full rename ahead of the Pydantic AI agent migration. Status enum
``scraping`` → ``fetching_details``; ``episodes/scraper.py`` →
``episodes/fetch_details_step.py``; ``episodes/agents/{agent,browser,deps,
tools,resume}.py`` → ``recovery_*.py`` (recovery agent files marked
transitional ahead of deletion). Function ``scrape_episode`` →
``fetch_episode_details``; telemetry span name + recovery
``can_handle()`` literal flipped accordingly.
Env-var convention shifts to Convention B for the new step:
``RAGTIME_SCRAPING_PROVIDER`` is dropped (provider encoded in the model
string prefix), ``RAGTIME_SCRAPING_MODEL`` →
``RAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-mini``,
``RAGTIME_SCRAPING_API_KEY`` → ``RAGTIME_FETCH_DETAILS_API_KEY``.
Configure wizard updated to skip ``_PROVIDER`` for Convention B
subsystems while continuing to share the API key across the LLM group.
Status data migration ``0021_rename_scraping_to_fetching_details``
rewrites every ``"scraping"`` literal in ``Episode.status``,
``ProcessingStep.step_name``, ``PipelineEvent.step_name``, and
``ProcessingRun.resumed_from_step``, then alters the ``Episode.status``
choices.
``get_scraping_provider`` is kept (transitional) and rewired to read the
new env vars + parse the ``provider:model`` prefix; it is replaced by the
``fetch_details`` agent and deleted in the follow-up commit.
Docs (README, AGENTS.md, doc/README.md) updated for the rename. Excalidraw
diagrams flagged in the PR description for manual update — they cannot be
auto-regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the legacy ``LLMProvider`` call with a Pydantic AI agent.
Adds ``episodes/agents/fetch_details.py`` (single structured-output
``Agent.run`` call, no tools yet) and ``episodes/agents/_model.py``
(pure ``build_model(model_string, api_key)`` helper). Recovery agent
refactored to share ``_model.build_model`` instead of its own
``_build_model``. ``get_scraping_provider`` deleted from
``episodes/providers/factory.py``.
The fetch_details agent module imports only Pydantic AI, Pydantic,
stdlib, and ``_model`` — Django imports are inside the lazy
``get_agent()`` factory so the module is bootable from a bare
interpreter for unit and eval tests:
from episodes.agents.fetch_details import run, EpisodeDetails
``EpisodeDetails`` carries validators for ``published_at``
(``YYYY-MM-DD``, fall back to ``None``) and ``language`` (ISO 639-1
regex, fall back to ``None``). URLs stay ``str | None`` to preserve
the relative-URL tolerance from the previous prompt.
The step orchestrator (``episodes/fetch_details_step.py``) now calls
``asyncio.run(fetch_details.run(html))`` from its sync DBOS-step body
and copies fields with the existing empty-field-only merge. All
other behaviors (HTML fetch + clean, fast-path skip, completeness
check, save-before-fail-step ordering) are preserved.
Tests rewritten around ``Agent.override(model=TestModel(...))``: each
LLM-path test wraps the call site in an ``agent.override`` context
that feeds a deterministic ``EpisodeDetails`` payload through the
validator. The fast-path skip test patches ``_run_agent_sync``
directly and asserts it was not called. Adds a small
``EpisodeDetailsValidatorTests`` class with pure-Pydantic tests
(no Django, no agent run). Total: 338 tests pass.
Includes feature doc, implementation session transcript, and a
``**BREAKING**`` CHANGELOG entry covering both commits of this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rafacm
commented
Apr 28, 2026
Owner
Author
rafacm
left a comment
There was a problem hiding this comment.
Review comments from local pass. Full test suite passed locally, but these issues look worth fixing before merge.
Three fixes from the local review pass on PR #107: **Drop eager re-export from ``episodes/agents/__init__.py``.** ``from .recovery import run_recovery_agent`` was pulling Django (via ``episodes.models``) and Playwright into every importer of the package, defeating the module-purity contract documented on ``fetch_details.py``. ``episodes.agents`` is now an empty package; ``recovery.py`` and ``workflows.py`` import the recovery entry point by full path (``from .agents.recovery import run_recovery_agent``). Verified ``import episodes.agents.fetch_details`` no longer loads recovery or Django. Test patches updated to target the canonical location ``episodes.agents.recovery.run_recovery_agent``. **Wizard syncs Convention B model prefix to the shared provider.** In shared mode for the LLM group, picking a non-default provider (e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at its stale ``openai:`` prefix, producing an invalid key/model pair. ``configure._prompt_shared_system`` now strips the existing prefix from the displayed default and re-prepends ``<shared_provider>:`` so the proposal tracks the shared provider. Idempotent across re-runs. **``_model.build_model`` no longer mutates ``os.environ``.** The previous ``os.environ.setdefault(env_var, api_key)`` had two defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` silently won over the configured ``RAGTIME_*_API_KEY``, and (b) whichever agent built first locked its key into the process for all subsequent agents — a real cross-agent credential leak in long-lived workers that run both ``fetch_details`` and ``recovery``. The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel`` instances when the SDK is installed (symmetric to the existing OpenAI branch), threading the API key through the provider constructor. Unknown providers and missing SDKs fall through to the raw model string with a warning — no global state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups on top of b13d75f: **Wizard proposes a provider-appropriate model suffix.** The previous v1 fix synced only the prefix to the shared provider, so picking ``anthropic`` produced ``anthropic:gpt-4.1-mini`` — invalid on Enter. ``CONVENTION_B_PROVIDER_DEFAULTS`` now maps each known provider to a default model name (openai → gpt-4.1-mini, anthropic → claude-sonnet-4-20250514, google → gemini-2.5-pro), and the wizard's shared-mode loop rebuilds the proposal end-to-end. When the existing prefix already matches the shared provider, the user's suffix is preserved (idempotent across re-runs); otherwise the provider's default suffix is swapped in. **``_model.build_model`` falls back to env-scoped resolution when the provider submodule can't be imported.** The previous v1 fix returned the raw model string in this case, which let ``Agent(model_string)`` resolve through Pydantic AI's ``infer_model`` and pick up any ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` — silently overriding the configured ``RAGTIME_*_API_KEY``. The helper now wraps the late-resolution path in ``_temp_env(env_var, api_key)``, a ``contextlib.contextmanager`` that sets the env var, calls ``infer_model`` (which captures the key into the provider's HTTP client during construction), and restores the prior value (or removes the var if it didn't exist) — even on exception. Concurrent / subsequent agents with different configured keys can no longer cross-contaminate. For unknown providers we still return the raw string with a warning, because we don't know which env var to scope. Adds ``episodes/tests/test_agents_model.py`` (10 tests covering the OpenAI / Google concrete branches, ``_temp_env`` set/restore/exception behavior, the Anthropic fallback's env-var restoration on ``ImportError``, and the unknown-provider warning path) and one new configure-wizard test asserting the Anthropic shared-provider scenario produces ``anthropic:claude-sonnet-4-20250514``. Total: 349 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rafacm
added a commit
that referenced
this pull request
Apr 28, 2026
Three fixes from the local review pass on PR #107: **Drop eager re-export from ``episodes/agents/__init__.py``.** ``from .recovery import run_recovery_agent`` was pulling Django (via ``episodes.models``) and Playwright into every importer of the package, defeating the module-purity contract documented on ``fetch_details.py``. ``episodes.agents`` is now an empty package; ``recovery.py`` and ``workflows.py`` import the recovery entry point by full path (``from .agents.recovery import run_recovery_agent``). Verified ``import episodes.agents.fetch_details`` no longer loads recovery or Django. Test patches updated to target the canonical location ``episodes.agents.recovery.run_recovery_agent``. **Wizard syncs Convention B model prefix to the shared provider.** In shared mode for the LLM group, picking a non-default provider (e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at its stale ``openai:`` prefix, producing an invalid key/model pair. ``configure._prompt_shared_system`` now strips the existing prefix from the displayed default and re-prepends ``<shared_provider>:`` so the proposal tracks the shared provider. Idempotent across re-runs. **``_model.build_model`` no longer mutates ``os.environ``.** The previous ``os.environ.setdefault(env_var, api_key)`` had two defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY`` silently won over the configured ``RAGTIME_*_API_KEY``, and (b) whichever agent built first locked its key into the process for all subsequent agents — a real cross-agent credential leak in long-lived workers that run both ``fetch_details`` and ``recovery``. The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel`` instances when the SDK is installed (symmetric to the existing OpenAI branch), threading the API key through the provider constructor. Unknown providers and missing SDKs fall through to the raw model string with a warning — no global state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scrape→Fetch Details(full rename, BREAKING for env vars + status string).LLMProviderfactory to a Pydantic AI agent (episodes/agents/fetch_details.py) — single structured-output call, no tools yet.episodes/agents/_model.pybuild_model(model_string, api_key)helper, used by both the new fetch_details agent and the existing recovery agent.Closes #106. Plan:
doc/plans/2026-04-28-fetch-details-agent.md. Feature doc:doc/features/2026-04-28-fetch-details-agent.md.Why this shape
This is the first of a planned series of step → agent migrations. The Scrape step was structurally simple (one LLM call, no tools), making it a clean SDK swap that locks in the agent shape before subsequent PRs grow it (browser tools absorbed from the recovery agent, eventually deprecating the recovery cluster altogether).
Module purity rule:
episodes/agents/fetch_details.pyimports only Pydantic AI, Pydantic, stdlib, and_model.py— no Django, no DBOS, noepisodes.models. Tests can dofrom episodes.agents.fetch_details import run, EpisodeDetailswithout booting Django.What changed
Status enum:
SCRAPING = "scraping"→FETCHING_DETAILS = "fetching_details". Migration0021_rename_scraping_to_fetching_detailsrewrites every persisted literal inEpisode.status,ProcessingStep.step_name,PipelineEvent.step_name,ProcessingRun.resumed_from_step.Files:
episodes/scraper.py→episodes/fetch_details_step.py. Recovery agent files renamedagents/{agent,browser,deps,tools,resume}.py→recovery_*.py(transitional prefix ahead of deletion).Function:
scrape_episode→fetch_episode_details. Telemetry span name + recoverycan_handle()literal flipped.Env vars (Convention B):
RAGTIME_SCRAPING_PROVIDERRAGTIME_SCRAPING_MODEL=gpt-4.1-miniRAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-miniRAGTIME_SCRAPING_API_KEYRAGTIME_FETCH_DETAILS_API_KEYConfigure wizard skips writing
_PROVIDERfor Convention B subsystems (those whosefieldslist omitsPROVIDER) while continuing to share the API key across the LLM group.get_scraping_providerdeleted fromepisodes/providers/factory.py.Step orchestrator delegates the LLM call to the agent via
asyncio.run. Preserves fast-path skip when required fields are pre-filled, empty-field-only merge, save-before-fail-step ordering for recovery interaction.Recovery agent's local
_build_modelreplaced by_model.build_modelshared call.PR shape
Two commits, per the plan's "Option I":
can_handleliteral, status data migration, README/AGENTS.md/doc/README.md, plan + planning session transcript._model.pyandagents/fetch_details.py, refactorfetch_details_step.pyto call the agent, refactor recovery agent to share_model.py, deleteget_scraping_provider, rewrite tests, feature doc + implementation session transcript + CHANGELOG.Manual diagram updates — deferred
The Excalidraw diagrams cannot be auto-regenerated and are deferred to the follow-up PR that migrates the Download step to its own agent — both PRs will need the same set of edits, so it's cheaper to do them once together.
(deferred to Download-agent PR)ragtime-processing-pipeline.{excalidraw,svg}— rename "Scrape" box to "Fetch Details".(deferred)ragtime-processing-pipeline-with-recovery.{excalidraw,svg}— same rename.(deferred)ragtime-recovery.{excalidraw,svg}— review for any "scraping" labels and update.Test plan
uv run python manage.py test— full suite, 349 tests pass.uv run python manage.py check— clean.uv run python manage.py makemigrations --dry-run— no extra migrations needed (model state matches0021).pending → queued → fetching_details → downloading → .... WithRAGTIME_OTEL_COLLECTORS=jaegerthe trace span appears asfetch_episode_details.RAGTIME_RECOVERY_AGENT_ENABLED=trueand confirmAgentStrategy.can_handlematches the new literal.Diagrams updated— deferred to Download-agent PR.🤖 Generated with Claude Code