Skip to content

TF-IDF match threshold: known content-type sensitivity #95

@tuirk

Description

@tuirk

What this is

A note on a known sensitivity in app/src/app/api/compile/match/route.ts:172, not a prescribed fix. Anyone who wants to take it on should treat the rest of this as context, not a spec.

The shape of the problem

The TF-IDF cosine floor at Phase 3 (Match) is currently a single hardcoded 0.15. The value is the right floor for some source shapes and the wrong floor for others, and the right answer for mixed corpora isn't obvious.

What "0.15" does:

  • For short web-heavy sources (blog posts, READMEs, articles in the 2-10K char range), cosine similarities against existing wiki pages sit comfortably in the 0.2-0.5 band. 0.15 is below the noise band — it catches real candidates and the triage LLM filters precision after.
  • For long rigor-heavy sources (academic papers, technical reports, multi-thousand-word dense essays), the same calculation produces similarities that compress downward — 0.08-0.25 range. The 0.15 floor sits inside the legitimate-match band, so real matches start landing below it. The "EXPLAINING THE FAVORITE-LONGSHOT BIAS" case in the inline comment at match/route.ts:162-170 is exactly this: 58K-char paper vs 20-page corpus, correct matches at 0.208 / 0.185 / 0.137 / 0.118, two of them under what used to be the 0.3 floor (which is what triggered the recalibration to 0.15).
  • The lower the floor, the more triage LLM calls per source; the slice(0, 3) cap keeps the worst case at sourceCount × 3 triage calls. Triage is the cheapest LLM call site in the pipeline (thinking_budget=0, max_output_tokens=512), so the cost of being too permissive is bounded and visible in the daily-cap accounting. The cost of being too strict is invisible — real matches silently never reach triage.

Why "just make it content-type-aware" isn't trivially the answer

The naive fix is a length branch — short content gets 0.15, long content gets something lower. That works for homogeneous sessions (a batch of all-academic-papers, or all-blog-posts). It doesn't cleanly work for the realistic case, which is mixed content in the same compile session and the same wiki.

Three things make mixed content hard:

  1. The wiki page corpus is shared. When a short web source and a long academic source both run TF-IDF against the same set of existing pages, the similarities they produce live on different scales. A 0.18 from the short source and a 0.18 from the long source don't mean the same thing — but the comparison is happening against the same pages. Length-branching the source's threshold doesn't normalize across this; it just makes each source's recall criterion source-shape-aware.

  2. The threshold isn't the only knob. slice(0, 3) caps the candidate set per source; raising the cap on long-source paths is another way to reach the same outcome. The right answer might be "lower floor AND larger top-K for long content" or "same floor but a different ranking signal entirely" (BM25, character-n-gram cosine, something embedding-based using the same all-MiniLM-L6-v2 model that already runs locally).

  3. Mixed sessions surface a calibration question we don't currently answer. What's the "right" recall for a session that has 3 blog posts and 1 paper? Should the paper's matches and the posts' matches use comparable floors so triage cost is even, or different floors so each source individually surfaces its real candidates regardless of comparative cost?

Directions worth thinking about, none of them mandates

  • Length-based branch. Cheapest landing. Pick a LONG_SOURCE_CHARS boundary (the existing extract-route profile already classifies short/medium/long), pick a LONG_TFIDF_FLOOR. Doesn't address the cross-source scale problem but reasonably catches the long-form recall miss.
  • Profile-based branch. extractions.profile is already computed and persisted. A profile → threshold map is more principled than a raw length check and would let medium-entity-heavy vs medium-concept-heavy get different floors if that turns out to matter.
  • Normalize the similarity scale, not the threshold. Map raw cosine into a z-score or quantile against the corpus's own similarity distribution per source. Same floor everywhere, but expressed in a unit that's comparable across source shapes. Heavier change; more principled.
  • Different ranking signal for long-form. TF-IDF rewards rare-token overlap; for dense vocabulary-rich material the rare tokens that matter (entity names, technical terms) get drowned out by the rest. An embedding-based candidate ranker for the long path (against the Chroma collection that already exists) might recall the right pages without a threshold tuning loop. Bigger change again.
  • Live the current behavior with a documented gotcha. Pin a note in CLAUDE.md and docs/spec/02-pipeline.md (the next time those get updated) so anyone wondering why their academic-paper compile produced a thin source-summary-only result has a thread to pull on. Lowest-effort, doesn't solve the underlying issue, may be the right answer if other priorities outweigh this.

What's known vs unknown

  • Known: The threshold is content-shape-sensitive in practice. 0.15 came from recalibrating against long-form; the recalibration didn't fix the problem for all content shapes, it shifted which shape it's wrong for.
  • Known: The cost of being too permissive is bounded and observable (triage call count × triage price, capped at sourceCount × 3). The cost of being too strict is silent missed matches.
  • Unknown: What the right LONG_TFIDF_FLOOR actually is. Needs calibration against a long-form corpus the way 0.15 was calibrated against the prediction-markets one. The favourite-longshot-bias paper case suggests something in the 0.08-0.12 range would be safe, but one data point isn't a calibration.
  • Unknown: Whether the problem is better solved at the threshold layer at all, or whether the ranking signal itself needs to change for long-form material.
  • Unknown: How mixed-content sessions should be reasoned about. The current code doesn't acknowledge them as a distinct shape; whatever fix lands here should at minimum name what it does for mixed sessions, even if "treat each source independently" is the chosen non-answer.

Files in scope (for whoever picks this up)

  • app/src/app/api/compile/match/route.ts — the floor, the inline comment, the per-source candidate filter
  • nlp-service/routers/extraction.py/extract/tfidf-overlap is where the cosine computation lives; if normalization changes, it changes here
  • app/src/lib/db.ts — would need a getProfileForSource(source_id) helper for the profile-based direction

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestup-for-grabsMaintainer isn't actively working on this — open for anyone to pick up

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions