Skip to content

Migrate Fetch Details step (was: Scrape) to Pydantic AI agent#107

Merged
rafacm merged 4 commits into
mainfrom
feature/fetch-details-agent
Apr 28, 2026
Merged

Migrate Fetch Details step (was: Scrape) to Pydantic AI agent#107
rafacm merged 4 commits into
mainfrom
feature/fetch-details-agent

Conversation

@rafacm
Copy link
Copy Markdown
Owner

@rafacm rafacm commented Apr 28, 2026

Summary

  • Renames pipeline step ScrapeFetch Details (full rename, BREAKING for env vars + status string).
  • Migrates the step from the legacy LLMProvider factory to a Pydantic AI agent (episodes/agents/fetch_details.py) — single structured-output call, no tools yet.
  • Introduces shared episodes/agents/_model.py build_model(model_string, api_key) helper, used by both the new fetch_details agent and the existing recovery agent.

Closes #106. Plan: doc/plans/2026-04-28-fetch-details-agent.md. Feature doc: doc/features/2026-04-28-fetch-details-agent.md.

Why this shape

This is the first of a planned series of step → agent migrations. The Scrape step was structurally simple (one LLM call, no tools), making it a clean SDK swap that locks in the agent shape before subsequent PRs grow it (browser tools absorbed from the recovery agent, eventually deprecating the recovery cluster altogether).

Module purity rule: episodes/agents/fetch_details.py imports only Pydantic AI, Pydantic, stdlib, and _model.py — no Django, no DBOS, no episodes.models. Tests can do from episodes.agents.fetch_details import run, EpisodeDetails without booting Django.

What changed

  • Status enum: SCRAPING = "scraping"FETCHING_DETAILS = "fetching_details". Migration 0021_rename_scraping_to_fetching_details rewrites every persisted literal in Episode.status, ProcessingStep.step_name, PipelineEvent.step_name, ProcessingRun.resumed_from_step.

  • Files: episodes/scraper.pyepisodes/fetch_details_step.py. Recovery agent files renamed agents/{agent,browser,deps,tools,resume}.pyrecovery_*.py (transitional prefix ahead of deletion).

  • Function: scrape_episodefetch_episode_details. Telemetry span name + recovery can_handle() literal flipped.

  • Env vars (Convention B):

    Removed Added
    RAGTIME_SCRAPING_PROVIDER (provider encoded in model string prefix)
    RAGTIME_SCRAPING_MODEL=gpt-4.1-mini RAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-mini
    RAGTIME_SCRAPING_API_KEY RAGTIME_FETCH_DETAILS_API_KEY

    Configure wizard skips writing _PROVIDER for Convention B subsystems (those whose fields list omits PROVIDER) while continuing to share the API key across the LLM group.

  • get_scraping_provider deleted from episodes/providers/factory.py.

  • Step orchestrator delegates the LLM call to the agent via asyncio.run. Preserves fast-path skip when required fields are pre-filled, empty-field-only merge, save-before-fail-step ordering for recovery interaction.

  • Recovery agent's local _build_model replaced by _model.build_model shared call.

PR shape

Two commits, per the plan's "Option I":

  1. Rename — files, status enum, env vars, configure wizard, recovery can_handle literal, status data migration, README/AGENTS.md/doc/README.md, plan + planning session transcript.
  2. Agent migration — introduce _model.py and agents/fetch_details.py, refactor fetch_details_step.py to call the agent, refactor recovery agent to share _model.py, delete get_scraping_provider, rewrite tests, feature doc + implementation session transcript + CHANGELOG.

Manual diagram updates — deferred

The Excalidraw diagrams cannot be auto-regenerated and are deferred to the follow-up PR that migrates the Download step to its own agent — both PRs will need the same set of edits, so it's cheaper to do them once together.

  • ragtime-processing-pipeline.{excalidraw,svg} — rename "Scrape" box to "Fetch Details". (deferred to Download-agent PR)
  • ragtime-processing-pipeline-with-recovery.{excalidraw,svg} — same rename. (deferred)
  • ragtime-recovery.{excalidraw,svg} — review for any "scraping" labels and update. (deferred)

Test plan

  • uv run python manage.py test — full suite, 349 tests pass.
  • uv run python manage.py check — clean.
  • uv run python manage.py makemigrations --dry-run — no extra migrations needed (model state matches 0021).
  • On-device: submit a fresh episode, watch it move pending → queued → fetching_details → downloading → .... With RAGTIME_OTEL_COLLECTORS=jaeger the trace span appears as fetch_episode_details.
  • On-device: trigger a fetch_details failure (e.g. an invalid URL) with RAGTIME_RECOVERY_AGENT_ENABLED=true and confirm AgentStrategy.can_handle matches the new literal.
  • Diagrams updated — deferred to Download-agent PR.

🤖 Generated with Claude Code

rafacm and others added 2 commits April 28, 2026 13:34
Full rename ahead of the Pydantic AI agent migration. Status enum
``scraping`` → ``fetching_details``; ``episodes/scraper.py`` →
``episodes/fetch_details_step.py``; ``episodes/agents/{agent,browser,deps,
tools,resume}.py`` → ``recovery_*.py`` (recovery agent files marked
transitional ahead of deletion). Function ``scrape_episode`` →
``fetch_episode_details``; telemetry span name + recovery
``can_handle()`` literal flipped accordingly.

Env-var convention shifts to Convention B for the new step:
``RAGTIME_SCRAPING_PROVIDER`` is dropped (provider encoded in the model
string prefix), ``RAGTIME_SCRAPING_MODEL`` →
``RAGTIME_FETCH_DETAILS_MODEL=openai:gpt-4.1-mini``,
``RAGTIME_SCRAPING_API_KEY`` → ``RAGTIME_FETCH_DETAILS_API_KEY``.
Configure wizard updated to skip ``_PROVIDER`` for Convention B
subsystems while continuing to share the API key across the LLM group.

Status data migration ``0021_rename_scraping_to_fetching_details``
rewrites every ``"scraping"`` literal in ``Episode.status``,
``ProcessingStep.step_name``, ``PipelineEvent.step_name``, and
``ProcessingRun.resumed_from_step``, then alters the ``Episode.status``
choices.

``get_scraping_provider`` is kept (transitional) and rewired to read the
new env vars + parse the ``provider:model`` prefix; it is replaced by the
``fetch_details`` agent and deleted in the follow-up commit.

Docs (README, AGENTS.md, doc/README.md) updated for the rename. Excalidraw
diagrams flagged in the PR description for manual update — they cannot be
auto-regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the legacy ``LLMProvider`` call with a Pydantic AI agent.
Adds ``episodes/agents/fetch_details.py`` (single structured-output
``Agent.run`` call, no tools yet) and ``episodes/agents/_model.py``
(pure ``build_model(model_string, api_key)`` helper). Recovery agent
refactored to share ``_model.build_model`` instead of its own
``_build_model``. ``get_scraping_provider`` deleted from
``episodes/providers/factory.py``.

The fetch_details agent module imports only Pydantic AI, Pydantic,
stdlib, and ``_model`` — Django imports are inside the lazy
``get_agent()`` factory so the module is bootable from a bare
interpreter for unit and eval tests:

    from episodes.agents.fetch_details import run, EpisodeDetails

``EpisodeDetails`` carries validators for ``published_at``
(``YYYY-MM-DD``, fall back to ``None``) and ``language`` (ISO 639-1
regex, fall back to ``None``). URLs stay ``str | None`` to preserve
the relative-URL tolerance from the previous prompt.

The step orchestrator (``episodes/fetch_details_step.py``) now calls
``asyncio.run(fetch_details.run(html))`` from its sync DBOS-step body
and copies fields with the existing empty-field-only merge. All
other behaviors (HTML fetch + clean, fast-path skip, completeness
check, save-before-fail-step ordering) are preserved.

Tests rewritten around ``Agent.override(model=TestModel(...))``: each
LLM-path test wraps the call site in an ``agent.override`` context
that feeds a deterministic ``EpisodeDetails`` payload through the
validator. The fast-path skip test patches ``_run_agent_sync``
directly and asserts it was not called. Adds a small
``EpisodeDetailsValidatorTests`` class with pure-Pydantic tests
(no Django, no agent run). Total: 338 tests pass.

Includes feature doc, implementation session transcript, and a
``**BREAKING**`` CHANGELOG entry covering both commits of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@rafacm rafacm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments from local pass. Full test suite passed locally, but these issues look worth fixing before merge.

Comment thread episodes/agents/__init__.py Outdated
Comment thread core/management/commands/configure.py Outdated
Comment thread episodes/agents/_model.py Outdated
rafacm and others added 2 commits April 28, 2026 14:00
Three fixes from the local review pass on PR #107:

**Drop eager re-export from ``episodes/agents/__init__.py``.**
``from .recovery import run_recovery_agent`` was pulling Django (via
``episodes.models``) and Playwright into every importer of the
package, defeating the module-purity contract documented on
``fetch_details.py``. ``episodes.agents`` is now an empty package;
``recovery.py`` and ``workflows.py`` import the recovery entry point
by full path (``from .agents.recovery import run_recovery_agent``).
Verified ``import episodes.agents.fetch_details`` no longer loads
recovery or Django. Test patches updated to target the canonical
location ``episodes.agents.recovery.run_recovery_agent``.

**Wizard syncs Convention B model prefix to the shared provider.**
In shared mode for the LLM group, picking a non-default provider
(e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at
its stale ``openai:`` prefix, producing an invalid key/model pair.
``configure._prompt_shared_system`` now strips the existing prefix
from the displayed default and re-prepends ``<shared_provider>:`` so
the proposal tracks the shared provider. Idempotent across re-runs.

**``_model.build_model`` no longer mutates ``os.environ``.**
The previous ``os.environ.setdefault(env_var, api_key)`` had two
defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY``
silently won over the configured ``RAGTIME_*_API_KEY``, and
(b) whichever agent built first locked its key into the process for
all subsequent agents — a real cross-agent credential leak in
long-lived workers that run both ``fetch_details`` and ``recovery``.
The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel``
instances when the SDK is installed (symmetric to the existing OpenAI
branch), threading the API key through the provider constructor.
Unknown providers and missing SDKs fall through to the raw model
string with a warning — no global state mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups on top of b13d75f:

**Wizard proposes a provider-appropriate model suffix.**
The previous v1 fix synced only the prefix to the shared provider, so
picking ``anthropic`` produced ``anthropic:gpt-4.1-mini`` — invalid
on Enter. ``CONVENTION_B_PROVIDER_DEFAULTS`` now maps each known
provider to a default model name (openai → gpt-4.1-mini,
anthropic → claude-sonnet-4-20250514, google → gemini-2.5-pro), and
the wizard's shared-mode loop rebuilds the proposal end-to-end. When
the existing prefix already matches the shared provider, the user's
suffix is preserved (idempotent across re-runs); otherwise the
provider's default suffix is swapped in.

**``_model.build_model`` falls back to env-scoped resolution when the
provider submodule can't be imported.**
The previous v1 fix returned the raw model string in this case, which
let ``Agent(model_string)`` resolve through Pydantic AI's
``infer_model`` and pick up any ambient ``ANTHROPIC_API_KEY`` /
``GOOGLE_API_KEY`` — silently overriding the configured
``RAGTIME_*_API_KEY``. The helper now wraps the late-resolution path
in ``_temp_env(env_var, api_key)``, a ``contextlib.contextmanager``
that sets the env var, calls ``infer_model`` (which captures the key
into the provider's HTTP client during construction), and restores
the prior value (or removes the var if it didn't exist) — even on
exception. Concurrent / subsequent agents with different configured
keys can no longer cross-contaminate.

For unknown providers we still return the raw string with a warning,
because we don't know which env var to scope.

Adds ``episodes/tests/test_agents_model.py`` (10 tests covering the
OpenAI / Google concrete branches, ``_temp_env`` set/restore/exception
behavior, the Anthropic fallback's env-var restoration on
``ImportError``, and the unknown-provider warning path) and one new
configure-wizard test asserting the Anthropic shared-provider
scenario produces ``anthropic:claude-sonnet-4-20250514``. Total:
349 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafacm rafacm merged commit 1629383 into main Apr 28, 2026
1 check passed
rafacm added a commit that referenced this pull request Apr 28, 2026
Three fixes from the local review pass on PR #107:

**Drop eager re-export from ``episodes/agents/__init__.py``.**
``from .recovery import run_recovery_agent`` was pulling Django (via
``episodes.models``) and Playwright into every importer of the
package, defeating the module-purity contract documented on
``fetch_details.py``. ``episodes.agents`` is now an empty package;
``recovery.py`` and ``workflows.py`` import the recovery entry point
by full path (``from .agents.recovery import run_recovery_agent``).
Verified ``import episodes.agents.fetch_details`` no longer loads
recovery or Django. Test patches updated to target the canonical
location ``episodes.agents.recovery.run_recovery_agent``.

**Wizard syncs Convention B model prefix to the shared provider.**
In shared mode for the LLM group, picking a non-default provider
(e.g. ``anthropic``) used to leave ``RAGTIME_FETCH_DETAILS_MODEL`` at
its stale ``openai:`` prefix, producing an invalid key/model pair.
``configure._prompt_shared_system`` now strips the existing prefix
from the displayed default and re-prepends ``<shared_provider>:`` so
the proposal tracks the shared provider. Idempotent across re-runs.

**``_model.build_model`` no longer mutates ``os.environ``.**
The previous ``os.environ.setdefault(env_var, api_key)`` had two
defects: (a) an ambient ``ANTHROPIC_API_KEY`` / ``GOOGLE_API_KEY``
silently won over the configured ``RAGTIME_*_API_KEY``, and
(b) whichever agent built first locked its key into the process for
all subsequent agents — a real cross-agent credential leak in
long-lived workers that run both ``fetch_details`` and ``recovery``.
The helper now constructs concrete ``AnthropicModel`` / ``GoogleModel``
instances when the SDK is installed (symmetric to the existing OpenAI
branch), threading the API key through the provider constructor.
Unknown providers and missing SDKs fall through to the raw model
string with a warning — no global state mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafacm rafacm deleted the feature/fetch-details-agent branch April 28, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fetch Details agent (was: Scrape step)

1 participant