Conversation
The incomplete-metadata code path called fail_step() before episode.save(). Recovery runs synchronously inside fail_step via signal — the agent sets status=TRANSCRIBING, but the scraper's subsequent save() overwrites it back to FAILED from a stale local object. Split the shared save() into success/failure branches so the failure path saves before fail_step(), matching the pattern every other pipeline step already uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resolver saves the LLM's wikidata_id response directly to the DB. When the LLM returns noisy values like "Q172}]} Explanation: ", it exceeds Entity.wikidata_id max_length=20. Add _sanitize_qid() to extract the bare Q-ID via regex at both extraction points. Update documentation to cover both pipeline fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes two pipeline failure modes that can stall ingestion: (1) scraper failure path overwriting a recovery-updated Episode.status, and (2) resolver crashes caused by malformed/oversized LLM wikidata_id values.
Changes:
- Scraper: save
Episodefields in separate success/failure branches, ensuring the failure branch persists beforefail_step()(so synchronous recovery can’t be overwritten). - Resolver: add
_sanitize_qid()and apply it at bothwikidata_idextraction points to store only a bareQ\d+ID. - Tests/docs: add a scraper regression test plus plan/feature/session docs and a changelog entry.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| episodes/tests/test_scraper.py | Adds regression test ensuring scraper doesn’t overwrite a recovery-updated status. |
| episodes/scraper.py | Reorders/pivots persistence so failure state is saved before fail_step() emits synchronous recovery signal. |
| episodes/resolver.py | Introduces _sanitize_qid() and sanitizes LLM wikidata_id before persisting to DB. |
| doc/sessions/2026-03-23-*-planning-session.md | Planning transcript for scraper overwrite bug. |
| doc/sessions/2026-03-23-*-implementation-session.md | Implementation transcript covering both fixes. |
| doc/plans/2026-03-23-fix-scraper-recovery-status-overwrite.md | Plan document describing both bug timelines and fixes. |
| doc/features/2026-03-23-fix-scraper-recovery-status-overwrite.md | Feature doc summarizing changes and verification. |
| CHANGELOG.md | Adds a “Fixed” entry referencing the plan/feature/transcripts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address PR review: add SanitizeQidTests (bare QID, full URL, trailing garbage, empty string, no match) and an integration test that mocks the resolution provider to return a noisy wikidata_id and verifies the Entity is saved with the sanitized bare Q-ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bug 1 (scraper): When LLM extraction returns incomplete metadata,
fail_step()triggers agent recovery synchronously via signal. The agent succeeds and savesstatus=TRANSCRIBING, but the scraper's subsequentepisode.save()overwrites it back toFAILEDfrom a stale local object. The transcriber then seesstatus=failedand exits early — the pipeline gets stuck.Fix 1: Split the shared
episode.save()into success/failure branches so the failure path saves beforefail_step(), matching the pattern every other pipeline step already uses.Bug 2 (resolver): The resolving step crashes with
"value too long for type character varying(20)"because the LLM returns malformedwikidata_idvalues (e.g.,"Q172}]} Explanation: ") that exceedEntity.wikidata_idmax_length=20.Fix 2: Add
_sanitize_qid()to extract the bare Q-ID via regex (Q\d+) at both extraction points in the resolver.Test plan
test_incomplete_metadata_does_not_overwrite_recovery_statuspasses_sanitize_qidverified against real LLM output patterns (bare IDs, URLs, trailing garbage)🤖 Generated with Claude Code