fix(rag): clean PAGE BREAK pollution from titles + /health/deep endpoint#83
Open
thefiredev-cloud wants to merge 2 commits intomainfrom
Open
fix(rag): clean PAGE BREAK pollution from titles + /health/deep endpoint#83thefiredev-cloud wants to merge 2 commits intomainfrom
thefiredev-cloud wants to merge 2 commits intomainfrom
Conversation
Resolves Blocker #3 root cause (autonomous-followup #39). ## Migration 0050: PAGE BREAK title cleanup The 2026-04-24 CA county audit found 3 counties (Alpine, Fresno, Marin) returning 0 hits for "adult cardiac arrest" despite their mapped agencies having 862-2478 fully-embedded chunks each. Root cause: PDF extraction artifact polluted protocol_title with "--- PAGE BREAK ---" markers (68-72% of corpus for those agencies). The 5113ea9 short-term fix (× 0.3 demotion) couldn't surface real content because the polluted ratio dropped legitimate chunks below the 0.3 threshold. Surgical SQL cleanup of protocol_title (and numeric-prefix protocol_number) across 15 CA agencies — strips "--- PAGE BREAK ---" markers without touching content or embeddings. Total: 4,644 rows updated. UNK-* prefixes preserved (those should stay demoted by design). Backup: backups.manus_protocol_chunks_0050 captures pre-update state for fully-reversible rollback. Retain ≥30 days for #39 ingestion follow-up. ## Result CA county audit: 55/58 (94.8%) → 58/58 (100%) pass rate. Tests: 4942/4994 pass / 52 skipped / 0 fail (unchanged). Typecheck: clean. Lint: 0 errors / 56 warnings (- 5 from --fix). Affected agencies (15): - Santa Barbara (99.5% pollution), Coastal Valleys (97%), Kern (92%), Marin (72%), Mountain-Valley (68%), San Luis Obispo (32%), Central CA (31%), Northern CA (28%), Riverside (27%), Imperial (24%), Sierra-Sacramento (23%), Inland Counties (18%), North Coast (14%), Monterey (6%), Yolo (3%). ## /health/deep endpoint CLAUDE.md mandated `/health/deep` for external monitoring but the route didn't exist (404). Added as alias to healthHandler with quick=false forced. Critical services (DB, Supabase, Redis) plus Claude + Gemini embeddings now verified end-to-end on every deep probe. ## CLAUDE.md agent framework Documented the 6 custom subagents (pg-rag-engineer, pg-ios-shipper, pg-backend-guardian, pg-schema-steward, pg-ux-frontend, pg-release-orchestrator) and 6 slash commands (/pg-rag-audit, /pg-ship-ios, /pg-schema-check, /pg-backend-health, /pg-county-expand, /pg-fix-blocker) added to .claude/ for this workspace. Frameworks: Karpathy guidelines + Superpowers process discipline + VoltAgent specialists. The .claude/ directory is gitignored — agents live locally. ## Follow-up - Long-term root fix at ingestion time (#39): chunker should skip PAGE BREAK markers entirely instead of preserving them in titles. - Demotion logic in server/_core/rag/scoring.ts becomes a no-op for these 15 agencies but stays as a guard for future ingestion regressions.
✅ Deploy Preview for protocol-guide ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This was referenced Apr 24, 2026
Long-term fix for #39 / #84 root cause. Prevents the data-quality regression that PR #83 had to clean up surgically. extractTitle()'s fallback path scanned the first 10 non-empty lines for a title-like candidate, filtering pure-numeric and "page N" lines but NOT "--- PAGE BREAK ---" sentinels. PDFs joined page-by-page with "\n\n--- PAGE BREAK ---\n\n" separators (per protocol-extractor.ts:74 and ocr-pdf-extractor.ts) leaked the marker into the chunk title for agencies whose PDFs lack a strong first-line title pattern. Outcome of the previous regression: - 4,644 chunks across 15 CA agencies acquired "--- PAGE BREAK ---" in protocol_title, tripping the × 0.3 quality demotion in server/_core/rag/scoring.ts and dropping legitimate cardiac-arrest content below the 0.3 similarity threshold for 3 counties. This patch ensures future ingestions skip the marker line entirely, so the title falls through to either (a) a real subsequent line or (b) "Protocol N" default. The chunk content keeps the marker (it's useful for downstream parsers and search-vector tokenization) — only the title is protected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves Blocker #3 root cause permanently. CA county audit pass rate 55/58 (94.8%) → 58/58 (100%).
Background
The 5113ea9 short-term fix (× 0.3 quality demotion in rerank) couldn't surface real content for Alpine, Fresno, Marin because their corpus was 68–72% page-break-polluted. The demotion pushed legitimate chunks below the 0.3 similarity threshold → 0 results.
Root cause is upstream PDF extractor preserving "--- PAGE BREAK ---" markers as protocol_title prefixes. This PR's migration strips that pollution surgically without touching `content` or `embedding` fields, letting legitimate chunks score correctly.
When #39 (ingestion-time fix) lands, the demotion regex in `server/_core/rag/scoring.ts` becomes a no-op for these 15 agencies but stays as a guard for future regressions.
Verification
```
pnpm check → 0 errors
pnpm lint → 0 errors / 56 warnings
pnpm test → 4942 pass / 52 skip / 0 fail
ca-county-audit.mjs → 58/58 (100%)
```
Test plan
Affected agencies