Skip to content

Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow#85

Merged
rafacm merged 3 commits intomainfrom
fix/scraper-recovery-status-overwrite
Mar 23, 2026
Merged

Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow#85
rafacm merged 3 commits intomainfrom
fix/scraper-recovery-status-overwrite

Conversation

@rafacm
Copy link
Owner

@rafacm rafacm commented Mar 23, 2026

Summary

  • Bug 1 (scraper): When LLM extraction returns incomplete metadata, fail_step() triggers agent recovery synchronously via signal. The agent succeeds and saves status=TRANSCRIBING, but the scraper's subsequent episode.save() overwrites it back to FAILED from a stale local object. The transcriber then sees status=failed and exits early — the pipeline gets stuck.

  • Fix 1: Split the shared episode.save() into success/failure branches so the failure path saves before fail_step(), matching the pattern every other pipeline step already uses.

  • Bug 2 (resolver): The resolving step crashes with "value too long for type character varying(20)" because the LLM returns malformed wikidata_id values (e.g., "Q172}]} Explanation: ") that exceed Entity.wikidata_id max_length=20.

  • Fix 2: Add _sanitize_qid() to extract the bare Q-ID via regex (Q\d+) at both extraction points in the resolver.

Test plan

  • All 227 existing tests pass
  • New regression test test_incomplete_metadata_does_not_overwrite_recovery_status passes
  • _sanitize_qid verified against real LLM output patterns (bare IDs, URLs, trailing garbage)
  • Manual test: submit an episode whose page lacks an audio URL, confirm agent recovery succeeds and transcription starts
  • Manual test: run resolver on episode with Wikidata-enriched entities, confirm no varchar overflow

🤖 Generated with Claude Code

rafacm and others added 2 commits March 23, 2026 16:56
The incomplete-metadata code path called fail_step() before episode.save().
Recovery runs synchronously inside fail_step via signal — the agent sets
status=TRANSCRIBING, but the scraper's subsequent save() overwrites it back
to FAILED from a stale local object. Split the shared save() into
success/failure branches so the failure path saves before fail_step(),
matching the pattern every other pipeline step already uses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resolver saves the LLM's wikidata_id response directly to the DB.
When the LLM returns noisy values like "Q172}]} Explanation: ", it
exceeds Entity.wikidata_id max_length=20. Add _sanitize_qid() to
extract the bare Q-ID via regex at both extraction points.

Update documentation to cover both pipeline fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rafacm rafacm changed the title Fix scraper save overwriting recovery status Fix pipeline bugs: scraper recovery overwrite + resolver wikidata_id overflow Mar 23, 2026
@rafacm rafacm requested a review from Copilot March 23, 2026 16:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two pipeline failure modes that can stall ingestion: (1) scraper failure path overwriting a recovery-updated Episode.status, and (2) resolver crashes caused by malformed/oversized LLM wikidata_id values.

Changes:

  • Scraper: save Episode fields in separate success/failure branches, ensuring the failure branch persists before fail_step() (so synchronous recovery can’t be overwritten).
  • Resolver: add _sanitize_qid() and apply it at both wikidata_id extraction points to store only a bare Q\d+ ID.
  • Tests/docs: add a scraper regression test plus plan/feature/session docs and a changelog entry.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
episodes/tests/test_scraper.py Adds regression test ensuring scraper doesn’t overwrite a recovery-updated status.
episodes/scraper.py Reorders/pivots persistence so failure state is saved before fail_step() emits synchronous recovery signal.
episodes/resolver.py Introduces _sanitize_qid() and sanitizes LLM wikidata_id before persisting to DB.
doc/sessions/2026-03-23-*-planning-session.md Planning transcript for scraper overwrite bug.
doc/sessions/2026-03-23-*-implementation-session.md Implementation transcript covering both fixes.
doc/plans/2026-03-23-fix-scraper-recovery-status-overwrite.md Plan document describing both bug timelines and fixes.
doc/features/2026-03-23-fix-scraper-recovery-status-overwrite.md Feature doc summarizing changes and verification.
CHANGELOG.md Adds a “Fixed” entry referencing the plan/feature/transcripts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address PR review: add SanitizeQidTests (bare QID, full URL, trailing
garbage, empty string, no match) and an integration test that mocks
the resolution provider to return a noisy wikidata_id and verifies
the Entity is saved with the sanitized bare Q-ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rafacm rafacm merged commit f59227f into main Mar 23, 2026
5 checks passed
@rafacm rafacm deleted the fix/scraper-recovery-status-overwrite branch March 23, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants