Skip to content

fix(rag): clean PAGE BREAK pollution from titles + /health/deep endpoint#83

Open
thefiredev-cloud wants to merge 2 commits intomainfrom
fix/blocker-3-root-cause-page-break-cleanup
Open

fix(rag): clean PAGE BREAK pollution from titles + /health/deep endpoint#83
thefiredev-cloud wants to merge 2 commits intomainfrom
fix/blocker-3-root-cause-page-break-cleanup

Conversation

@thefiredev-cloud
Copy link
Copy Markdown
Owner

Summary

Resolves Blocker #3 root cause permanently. CA county audit pass rate 55/58 (94.8%) → 58/58 (100%).

  • Migration 0050 — surgical SQL cleanup of `protocol_title` / `protocol_number` PAGE BREAK pollution across 15 CA agencies (4,644 rows). Already applied to prod via Supabase MCP. Backup table `backups.manus_protocol_chunks_0050` retained for ≥30 days.
  • `/health/deep` endpoint — CLAUDE.md mandated; was missing. Added as alias with `quick=false` forced.
  • Lint —fix — 5 unused-import warnings cleaned.
  • Agent framework docs — 6 custom subagents + 6 slash commands documented in CLAUDE.md (`.claude/` is gitignored, agents live locally).

Background

The 5113ea9 short-term fix (× 0.3 quality demotion in rerank) couldn't surface real content for Alpine, Fresno, Marin because their corpus was 68–72% page-break-polluted. The demotion pushed legitimate chunks below the 0.3 similarity threshold → 0 results.

Root cause is upstream PDF extractor preserving "--- PAGE BREAK ---" markers as protocol_title prefixes. This PR's migration strips that pollution surgically without touching `content` or `embedding` fields, letting legitimate chunks score correctly.

When #39 (ingestion-time fix) lands, the demotion regex in `server/_core/rag/scoring.ts` becomes a no-op for these 15 agencies but stays as a guard for future regressions.

Verification

```
pnpm check → 0 errors
pnpm lint → 0 errors / 56 warnings
pnpm test → 4942 pass / 52 skip / 0 fail
ca-county-audit.mjs → 58/58 (100%)
```

Test plan

  • Verify migration is fully reversible via `backups.manus_protocol_chunks_0050`
  • CA county audit shows 100% pass rate on prod
  • Existing test suite unchanged (4942 pass)
  • Smoke-test `/health/deep` once Railway redeploys this branch
  • Manual sanity check on Alpine / Fresno / Marin via prod search UI

Affected agencies

agency_id name rows cleaned
2709 Santa Barbara 184
2611 Coastal Valleys 296
2717 Kern County 12
2714 Marin County 1778
2616 Mountain-Valley 586
2708 San Luis Obispo 60
2610 Central California 648
2617 Northern California 161
2706 Riverside 136
2700 Imperial 237
2621 Sierra-Sacramento 198
2613 Inland Counties 196
2705 North Coast 141
2703 Monterey 8
2713 Yolo 3
total 4,644

Resolves Blocker #3 root cause (autonomous-followup #39).

## Migration 0050: PAGE BREAK title cleanup

The 2026-04-24 CA county audit found 3 counties (Alpine, Fresno, Marin)
returning 0 hits for "adult cardiac arrest" despite their mapped
agencies having 862-2478 fully-embedded chunks each. Root cause:
PDF extraction artifact polluted protocol_title with "--- PAGE BREAK ---"
markers (68-72% of corpus for those agencies). The 5113ea9 short-term
fix (× 0.3 demotion) couldn't surface real content because the polluted
ratio dropped legitimate chunks below the 0.3 threshold.

Surgical SQL cleanup of protocol_title (and numeric-prefix protocol_number)
across 15 CA agencies — strips "--- PAGE BREAK ---" markers without
touching content or embeddings. Total: 4,644 rows updated. UNK-* prefixes
preserved (those should stay demoted by design).

Backup: backups.manus_protocol_chunks_0050 captures pre-update state for
fully-reversible rollback. Retain ≥30 days for #39 ingestion follow-up.

## Result

CA county audit: 55/58 (94.8%) → 58/58 (100%) pass rate.
Tests: 4942/4994 pass / 52 skipped / 0 fail (unchanged).
Typecheck: clean. Lint: 0 errors / 56 warnings (- 5 from --fix).

Affected agencies (15):
- Santa Barbara (99.5% pollution), Coastal Valleys (97%),
  Kern (92%), Marin (72%), Mountain-Valley (68%),
  San Luis Obispo (32%), Central CA (31%), Northern CA (28%),
  Riverside (27%), Imperial (24%), Sierra-Sacramento (23%),
  Inland Counties (18%), North Coast (14%), Monterey (6%), Yolo (3%).

## /health/deep endpoint

CLAUDE.md mandated `/health/deep` for external monitoring but the route
didn't exist (404). Added as alias to healthHandler with quick=false
forced. Critical services (DB, Supabase, Redis) plus Claude + Gemini
embeddings now verified end-to-end on every deep probe.

## CLAUDE.md agent framework

Documented the 6 custom subagents (pg-rag-engineer, pg-ios-shipper,
pg-backend-guardian, pg-schema-steward, pg-ux-frontend,
pg-release-orchestrator) and 6 slash commands (/pg-rag-audit,
/pg-ship-ios, /pg-schema-check, /pg-backend-health, /pg-county-expand,
/pg-fix-blocker) added to .claude/ for this workspace. Frameworks:
Karpathy guidelines + Superpowers process discipline + VoltAgent
specialists. The .claude/ directory is gitignored — agents live locally.

## Follow-up

- Long-term root fix at ingestion time (#39): chunker should skip
  PAGE BREAK markers entirely instead of preserving them in titles.
- Demotion logic in server/_core/rag/scoring.ts becomes a no-op for these
  15 agencies but stays as a guard for future ingestion regressions.
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 24, 2026

Deploy Preview for protocol-guide ready!

Name Link
🔨 Latest commit 3472350
🔍 Latest deploy log https://app.netlify.com/projects/protocol-guide/deploys/69ebfd37f7116b000851719e
😎 Deploy Preview https://deploy-preview-83--protocol-guide.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Long-term fix for #39 / #84 root cause. Prevents the data-quality
regression that PR #83 had to clean up surgically.

extractTitle()'s fallback path scanned the first 10 non-empty lines for
a title-like candidate, filtering pure-numeric and "page N" lines but
NOT "--- PAGE BREAK ---" sentinels. PDFs joined page-by-page with
"\n\n--- PAGE BREAK ---\n\n" separators (per protocol-extractor.ts:74
and ocr-pdf-extractor.ts) leaked the marker into the chunk title for
agencies whose PDFs lack a strong first-line title pattern.

Outcome of the previous regression:
- 4,644 chunks across 15 CA agencies acquired "--- PAGE BREAK ---"
  in protocol_title, tripping the × 0.3 quality demotion in
  server/_core/rag/scoring.ts and dropping legitimate cardiac-arrest
  content below the 0.3 similarity threshold for 3 counties.

This patch ensures future ingestions skip the marker line entirely,
so the title falls through to either (a) a real subsequent line or
(b) "Protocol N" default. The chunk content keeps the marker (it's
useful for downstream parsers and search-vector tokenization) — only
the title is protected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants