Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow by rafacm · Pull Request #85 · rafacm/ragtime

rafacm · 2026-03-23T15:56:41Z

Summary

Bug 1 (scraper): When LLM extraction returns incomplete metadata, fail_step() triggers agent recovery synchronously via signal. The agent succeeds and saves status=TRANSCRIBING, but the scraper's subsequent episode.save() overwrites it back to FAILED from a stale local object. The transcriber then sees status=failed and exits early — the pipeline gets stuck.
Fix 1: Split the shared episode.save() into success/failure branches so the failure path saves before fail_step(), matching the pattern every other pipeline step already uses.
Bug 2 (resolver): The resolving step crashes with "value too long for type character varying(20)" because the LLM returns malformed wikidata_id values (e.g., "Q172}]} Explanation: ") that exceed Entity.wikidata_id max_length=20.
Fix 2: Add _sanitize_qid() to extract the bare Q-ID via regex (Q\d+) at both extraction points in the resolver.

Test plan

All 227 existing tests pass
New regression test test_incomplete_metadata_does_not_overwrite_recovery_status passes
_sanitize_qid verified against real LLM output patterns (bare IDs, URLs, trailing garbage)
Manual test: submit an episode whose page lacks an audio URL, confirm agent recovery succeeds and transcription starts
Manual test: run resolver on episode with Wikidata-enriched entities, confirm no varchar overflow

🤖 Generated with Claude Code

The incomplete-metadata code path called fail_step() before episode.save(). Recovery runs synchronously inside fail_step via signal — the agent sets status=TRANSCRIBING, but the scraper's subsequent save() overwrites it back to FAILED from a stale local object. Split the shared save() into success/failure branches so the failure path saves before fail_step(), matching the pattern every other pipeline step already uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The resolver saves the LLM's wikidata_id response directly to the DB. When the LLM returns noisy values like "Q172}]} Explanation: ", it exceeds Entity.wikidata_id max_length=20. Add _sanitize_qid() to extract the bare Q-ID via regex at both extraction points. Update documentation to cover both pipeline fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes two pipeline failure modes that can stall ingestion: (1) scraper failure path overwriting a recovery-updated Episode.status, and (2) resolver crashes caused by malformed/oversized LLM wikidata_id values.

Changes:

Scraper: save Episode fields in separate success/failure branches, ensuring the failure branch persists before fail_step() (so synchronous recovery can’t be overwritten).
Resolver: add _sanitize_qid() and apply it at both wikidata_id extraction points to store only a bare Q\d+ ID.
Tests/docs: add a scraper regression test plus plan/feature/session docs and a changelog entry.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
episodes/tests/test_scraper.py	Adds regression test ensuring scraper doesn’t overwrite a recovery-updated status.
episodes/scraper.py	Reorders/pivots persistence so failure state is saved before `fail_step()` emits synchronous recovery signal.
episodes/resolver.py	Introduces `_sanitize_qid()` and sanitizes LLM `wikidata_id` before persisting to DB.
doc/sessions/2026-03-23-*-planning-session.md	Planning transcript for scraper overwrite bug.
doc/sessions/2026-03-23-*-implementation-session.md	Implementation transcript covering both fixes.
doc/plans/2026-03-23-fix-scraper-recovery-status-overwrite.md	Plan document describing both bug timelines and fixes.
doc/features/2026-03-23-fix-scraper-recovery-status-overwrite.md	Feature doc summarizing changes and verification.
CHANGELOG.md	Adds a “Fixed” entry referencing the plan/feature/transcripts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

episodes/resolver.py

Address PR review: add SanitizeQidTests (bare QID, full URL, trailing garbage, empty string, no match) and an integration test that mocks the resolution provider to return a noisy wikidata_id and verifies the Entity is saved with the sanitized bare Q-ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rafacm and others added 2 commits March 23, 2026 16:56

rafacm changed the title ~~Fix scraper save overwriting recovery status~~ Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow Mar 23, 2026

rafacm requested a review from Copilot March 23, 2026 16:14

Copilot started reviewing on behalf of rafacm March 23, 2026 16:15 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

episodes/resolver.py Show resolved Hide resolved

rafacm requested a review from Copilot March 23, 2026 16:26

Copilot started reviewing on behalf of rafacm March 23, 2026 16:26 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

rafacm merged commit f59227f into main Mar 23, 2026
5 checks passed

rafacm deleted the fix/scraper-recovery-status-overwrite branch March 23, 2026 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow#85

Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow#85
rafacm merged 3 commits intomainfrom
fix/scraper-recovery-status-overwrite

rafacm commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rafacm commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafacm commented Mar 23, 2026 •

edited

Loading