You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Audit date: 2026-05-22 Corpus size at audit: thousands of sessions accumulated over months of multi-agent use Trigger: First end-to-end "use Smriti as a user would" eval since the daemon work began. Goal was to find out what's actually shippable, not just what passes unit tests.
This is a single tracking issue covering five queries' worth of findings, the methodology used to evaluate them, and the proposed fixes ranked by complexity. Split into sub-issues if/when work starts on individual findings.
Why this matters now
The v0.8.0 daemon (#71) will make Smriti capture much more data automatically. If retrieval quality is shaky today on a manually-curated corpus, it gets worse, not better, when the daemon is silently filling the DB. We need to know what's wrong before we 5x the data volume.
Methodology
For each query, we tracked five dimensions:
Dimension
Question
Precision
Are the retrieved hits actually about what was asked?
Recall
Did obvious-relevant sessions get missed?
Synthesis fidelity
Does the LLM output reflect what's in the sources, or does it hallucinate?
Latency
How long does the call take end-to-end?
Ground-truth check
Pick a specific claim per output, verify it against data we can independently confirm
Ground truths used (so hallucinations could be detected):
The 42-process pile-up: 13,449 CPU-minutes total, oldest from Wednesday, fix was lockf -t 0 /tmp/smriti-ingest.lock
Daemon design rejected three options during pre-impl smoke tests: chokidar (fires 0 events under Bun), socket-bind single-instance (silently steals connections), long-lived DB connection (Bun segfault, 6.8 GB RSS peak)
QMD upstream: fork is 49 commits behind, hits include 004714a / 3b7e065 / d045a8b / e36ab96
Current Claude session ID: 4a283f66-575d-47db-864e-9c77f9e0f07b
Test suite
#
Query
Command
1
"42 stuck processes that consumed 9 CPU days" with synthesis
smriti recall "..." --limit 5 --synthesize
2
Temporal drift on QMD
smriti drift "qmd"
3
RAG ask on a specific decision
smriti ask "how does the daemon enforce single-instance, and why"
4
Project-scoped list
smriti list --project smriti --limit 10
5
BM25 exact match
smriti search "lockf" --limit 8
Picked to cover: BM25-only retrieval (test 5), semantic retrieval (test 1's recall layer), LLM synthesis (tests 1 + 3), temporal aggregation (test 2), metadata-only listing (test 4). Each test stresses a different layer.
Findings
🔴 P0 — Observer-session crowding in retrieval
Severity: Affects every search/recall query. The single biggest product issue.
Evidence: In test 5 (BM25 for "lockf"), 6 of 8 hits were "Hello memory agent, you are continuing to observe..." sessions, all scoring 0.137–0.152. The primary session containing the actual lockf-and-daemon-design story scored 0.136 at hit #6 — below the observer noise. Same pattern in tests 1, 2, 3.
These observer sessions are created by claude-mem's plugin: they record summaries of other sessions' work. They're dense, well-formatted, and contain heavy term overlap with the primary sessions they observe. BM25 and vector retrieval both reward this density. The user almost always wants the primary source, not the observer's summary of it.
Impact: Users searching for their own work get someone else's notes about their own work. Confusing in good cases, misleading in bad cases (observer summaries can lag the primary or contain interpretation drift).
Possible fixes (ranked by complexity):
Quick: Filter observer sessions by default; expose --include-observers for explicit opt-in. Detect via agent name (claude-mem) or session title prefix. ~20 LOC in src/search/index.ts.
Medium: Add a session_type column (primary | observer | derived) on smriti_session_meta, populate at ingest time. Ranker applies a configurable boost/penalty based on type. ~100 LOC + a migration.
Architectural: Treat observer sessions as annotations on primary sessions, not as first-class searchables. They'd surface only when expanding the result for a primary session. ~weeks of work.
Recommended: ship quick fix in v0.8.1 (filter-by-default), revisit medium fix in v0.9.x.
🔴 P0 — Synthesis hallucinates from stale-draft data
Severity: Confidence-of-wrong-answer is the worst failure mode. Has shipped already (synthesis is a daily-use feature).
Evidence: Test 1's synthesis output included this in <next_steps>:
"Address cross-platform file system monitoring with chokidar abstraction."
We explicitly rejected chokidar in the daemon PRD's "Three pre-impl smoke-test findings" section — it fires zero events under Bun 1.3.6 in our test. But earlier drafts of the PRD (now superseded) recommended chokidar. Synthesis pulled from the older draft and confidently presented it as a next step.
For a tool whose explicit pitch is "team learning from each other's coding sessions," this is the exact opposite of the intended UX: rather than surfacing the lesson ("we tried chokidar, it didn't work, here's why"), it surfaces the rejected suggestion as if it were current.
Impact: Decisions can be silently reverted via synthesis. Users acting on a synthesized "next step" can re-introduce a bug we already learned to avoid.
Possible fixes:
Recency weighting in the ranker. When two sources discuss the same topic, prefer the newer one. Today's RRF fusion is content-blind to date. ~50 LOC in searchVec / searchFTS to add a recency boost.
Contradiction detection during synthesis.smriti recall --check-conflicts exists (Contradiction detection in recall results #67) but isn't run by default for --synthesize. Wire conflict detection into the synthesis prompt: if conflicts exist among sources, the prompt should surface "decisions evolved" rather than averaging them.
Explicit decisions ledger. When a decision is marked as superseding an earlier one (via category decision/superseded or explicit linking), synthesis treats the newer as canonical. Big design change; out of scope for v0.8.x.
Synthesis source-citation discipline. Make the synthesis prompt require each claim to cite a specific source ID, and have the post-processor verify the cited source actually contains the claim. Catches the worst form of hallucination at output time. ~moderate work.
Recommended: ship #1 (recency weighting) in v0.8.1, ship #2 (auto-run check-conflicts under synthesize) in v0.9.0.
🟡 P1 — Citation UX is broken in two specific ways
Severity: Affects every smriti ask output. Hurts trust in answers that are otherwise accurate.
Date parsing failure: every citation shows Invalid Date. Likely a new Date(undefined).toISOString() somewhere in the format layer, returning a string Node renders as "Invalid Date".
Indistinguishable titles: all five citations share the same title prefix from the observer-session prompt. Without dereferencing each session_id, the user can't tell what they're looking at.
Possible fixes:
Date bug: locate the date-formatting call (likely in src/format.ts or wherever smriti ask builds its citations), guard against undefined/null, fall back to "unknown date" or the message's actual timestamp from created_at. ~5 LOC.
Title bug: when a session title would collide with N other sessions, append a distinguishing suffix (date + first-line snippet, or session-id-short). Or: stop using the prompt as the title for observer sessions; derive a title from the observed work instead.
The title bug is partially the same problem as P0 (observer crowding) — fixing P0 may fix this implicitly.
🟡 P1 — smriti drift doesn't show evolution of thinking
Severity: A whole command provides little value beyond what smriti list --project ... --limit N already does.
Evidence (test 2): Asked smriti drift "qmd". Got back a narrative that says "they started doing X, then turned to Y, now Z" — a generic temporal arc. The narrative talks about files (server.ts, queue.ts) but not decisions or turning points. A useful drift for "qmd" would surface: "Mar 12: QMD treated as black-box dependency. Apr 4: discovered fork was 49 commits behind. May 19: decided to track upstream rather than diverge, started Smriti daemon design." Today's output returns dates and filenames where it should return narrative beats.
Topic-vs-keyword confusion also hit hard: drift on "qmd" returned 10 sessions that mention qmd but where qmd wasn't the topic.
Possible fixes:
Better synthesis prompt for drift. Today's prompt presumably says "summarize the evolution"; should say "identify 3-5 turning points and what changed at each." Free improvement.
Topic vs keyword filter: only include sessions where the keyword appears in the title or in a session-level category tag, not anywhere in the content. Cuts the corpus for drift dramatically.
Add a decision-marker hint: surface sessions categorized decision/* more heavily in drift output. We already have the category system; drift just doesn't use it.
🟡 P1 — BM25 dynamic range is squashed
Severity: Ranking is barely meaningful for short specific queries.
Evidence (test 5): Searching for "lockf" returned 8 hits scoring 0.128–0.152 — a 0.024 spread. For a term that appears 1 time in some sessions and 15+ times in others, the BM25 score differences should be 10x larger.
Likely the score normalization (the 1 / (1 + |bm25|) step in QMD's searchFTS) is mapping the natural range too aggressively. Could also be a chunking effect: if BM25 runs per-chunk and chunks have uniform size, term-frequency normalization within chunks washes out per-document signal.
Possible fixes:
Investigate first: log raw BM25 scores from FTS5 before normalization to confirm the spread is bigger than the visible one. ~10 minutes of instrumentation.
Adjust normalization: a less-aggressive transform like 1 - 1/(1 + 0.1*|bm25|) widens the range. Tune empirically.
Expose raw scores via --raw-scores flag for diagnostics — never user-facing but useful when debugging.
Probably an upstream QMD concern more than a Smriti one. File against tobi/qmd if confirmed.
🟡 P2 — "Ingested but not searchable" UX gap
Severity: Confusing edge case. Affects sessions during the gap between ingest and embed.
Evidence: The current Claude session (4a283f66-...) appears in smriti list --project smriti --limit 10 (test 4) but doesn't appear in any of the semantic recall tests (1, 3). It's been ingested as messages but hasn't been chunked + embedded yet. The user has no way to know which state a session is in.
Possible fixes:
Status column in smriti list: add a vector_state column (embedded | pending | failed). User can see at a glance which sessions are full-search-ready.
Auto-embed after ingest in the daemon: when a flush completes, kick off a small qmd embed --batch <project> for the newly-written chunks. Adds latency to the post-ingest path but closes the UX gap.
🟡 P2 — Silent failure when Ollama is unreachable
Severity: User runs --synthesize, gets raw search output, doesn't realize synthesis silently no-op'd.
Evidence: Before starting Ollama, smriti recall "..." --synthesize returned only the raw recall hits, no synthesis section, no error message. The synthesis call presumably timed out or refused-connection'd, and the catch handler just swallowed the failure.
Possible fixes:
Health-check Ollama before synthesis: probe http://127.0.0.1:11434/api/tags (or whatever) and if it fails, print a clear "Ollama unreachable at , returning raw recall hits. Run ollama serve to enable synthesis." Continue without throwing. ~10 LOC.
Cache the last-known-good model: if synthesis works once, remember the model+host. On next failure, surface "synthesis failed (was working at ) — Ollama may have crashed."
Repro recipe (regression check for future releases)
These five tests should be runnable on the corpus before each release. Recommend they be added to docs/internal/eval-suite.md along with this issue's ground truths.
# Before each release, on a real corpus:
smriti recall "the 42 stuck smriti ingest processes that consumed 9 CPU days" --limit 5 --synthesize
smriti drift "qmd"
smriti ask "what did we decide about how the daemon enforces single-instance, and why"
smriti list --project smriti --limit 10
smriti search "lockf" --limit 8
Score each on the five-dimension rubric above. If precision drops or hallucinations appear, block the release.
BM25 range investigation. Time-boxed: spend an hour on it; if it's an upstream-QMD issue, file there.
Drift + "ingested-but-not-searchable" + auto-embed-after-ingest. Larger features; queue for v0.9.x.
Do not block v0.8.0 on any of these. The daemon is a write-side feature; nothing here is a regression vs. the current state. But every line of this issue gets more important as v0.8.0 starts capturing more data without user effort.
Acceptance: how we'd know quality is fixed
Run the same five queries on the same corpus three releases later. The bar:
Test 5 (BM25 "lockf"): primary sessions occupy ≥3 of the top 5 hits
Test 1 (synthesize): no claims contradict known ground truths
Test 3 (ask): citations have real dates and distinguishable titles
Test 2 (drift): output surfaces ≥3 named decisions with dates
All tests: synthesis latency under 30s (today's worst was 87s)
When all five pass on a fresh corpus, this issue closes.
Audit date: 2026-05-22
Corpus size at audit: thousands of sessions accumulated over months of multi-agent use
Trigger: First end-to-end "use Smriti as a user would" eval since the daemon work began. Goal was to find out what's actually shippable, not just what passes unit tests.
This is a single tracking issue covering five queries' worth of findings, the methodology used to evaluate them, and the proposed fixes ranked by complexity. Split into sub-issues if/when work starts on individual findings.
Why this matters now
The v0.8.0 daemon (#71) will make Smriti capture much more data automatically. If retrieval quality is shaky today on a manually-curated corpus, it gets worse, not better, when the daemon is silently filling the DB. We need to know what's wrong before we 5x the data volume.
Methodology
For each query, we tracked five dimensions:
Ground truths used (so hallucinations could be detected):
lockf -t 0 /tmp/smriti-ingest.lock4a283f66-575d-47db-864e-9c77f9e0f07bTest suite
smriti recall "..." --limit 5 --synthesizesmriti drift "qmd"smriti ask "how does the daemon enforce single-instance, and why"smriti list --project smriti --limit 10smriti search "lockf" --limit 8Picked to cover: BM25-only retrieval (test 5), semantic retrieval (test 1's recall layer), LLM synthesis (tests 1 + 3), temporal aggregation (test 2), metadata-only listing (test 4). Each test stresses a different layer.
Findings
🔴 P0 — Observer-session crowding in retrieval
Severity: Affects every search/recall query. The single biggest product issue.
Evidence: In test 5 (BM25 for "lockf"), 6 of 8 hits were "Hello memory agent, you are continuing to observe..." sessions, all scoring 0.137–0.152. The primary session containing the actual lockf-and-daemon-design story scored 0.136 at hit #6 — below the observer noise. Same pattern in tests 1, 2, 3.
These observer sessions are created by claude-mem's plugin: they record summaries of other sessions' work. They're dense, well-formatted, and contain heavy term overlap with the primary sessions they observe. BM25 and vector retrieval both reward this density. The user almost always wants the primary source, not the observer's summary of it.
Impact: Users searching for their own work get someone else's notes about their own work. Confusing in good cases, misleading in bad cases (observer summaries can lag the primary or contain interpretation drift).
Possible fixes (ranked by complexity):
--include-observersfor explicit opt-in. Detect via agent name (claude-mem) or session title prefix. ~20 LOC insrc/search/index.ts.session_typecolumn (primary | observer | derived) onsmriti_session_meta, populate at ingest time. Ranker applies a configurable boost/penalty based on type. ~100 LOC + a migration.Recommended: ship quick fix in v0.8.1 (filter-by-default), revisit medium fix in v0.9.x.
🔴 P0 — Synthesis hallucinates from stale-draft data
Severity: Confidence-of-wrong-answer is the worst failure mode. Has shipped already (synthesis is a daily-use feature).
Evidence: Test 1's synthesis output included this in
<next_steps>:We explicitly rejected chokidar in the daemon PRD's "Three pre-impl smoke-test findings" section — it fires zero events under Bun 1.3.6 in our test. But earlier drafts of the PRD (now superseded) recommended chokidar. Synthesis pulled from the older draft and confidently presented it as a next step.
For a tool whose explicit pitch is "team learning from each other's coding sessions," this is the exact opposite of the intended UX: rather than surfacing the lesson ("we tried chokidar, it didn't work, here's why"), it surfaces the rejected suggestion as if it were current.
Impact: Decisions can be silently reverted via synthesis. Users acting on a synthesized "next step" can re-introduce a bug we already learned to avoid.
Possible fixes:
searchVec/searchFTSto add a recency boost.smriti recall --check-conflictsexists (Contradiction detection in recall results #67) but isn't run by default for--synthesize. Wire conflict detection into the synthesis prompt: if conflicts exist among sources, the prompt should surface "decisions evolved" rather than averaging them.decision/supersededor explicit linking), synthesis treats the newer as canonical. Big design change; out of scope for v0.8.x.Recommended: ship #1 (recency weighting) in v0.8.1, ship #2 (auto-run check-conflicts under synthesize) in v0.9.0.
🟡 P1 — Citation UX is broken in two specific ways
Severity: Affects every
smriti askoutput. Hurts trust in answers that are otherwise accurate.Evidence (test 3):
Two distinct bugs:
Invalid Date. Likely anew Date(undefined).toISOString()somewhere in the format layer, returning a string Node renders as "Invalid Date".Possible fixes:
src/format.tsor whereversmriti askbuilds its citations), guard against undefined/null, fall back to "unknown date" or the message's actual timestamp fromcreated_at. ~5 LOC.The title bug is partially the same problem as P0 (observer crowding) — fixing P0 may fix this implicitly.
🟡 P1 —
smriti driftdoesn't show evolution of thinkingSeverity: A whole command provides little value beyond what
smriti list --project ... --limit Nalready does.Evidence (test 2): Asked
smriti drift "qmd". Got back a narrative that says "they started doing X, then turned to Y, now Z" — a generic temporal arc. The narrative talks about files (server.ts, queue.ts) but not decisions or turning points. A useful drift for "qmd" would surface: "Mar 12: QMD treated as black-box dependency. Apr 4: discovered fork was 49 commits behind. May 19: decided to track upstream rather than diverge, started Smriti daemon design." Today's output returns dates and filenames where it should return narrative beats.Topic-vs-keyword confusion also hit hard: drift on "qmd" returned 10 sessions that mention qmd but where qmd wasn't the topic.
Possible fixes:
decision/*more heavily in drift output. We already have the category system; drift just doesn't use it.🟡 P1 — BM25 dynamic range is squashed
Severity: Ranking is barely meaningful for short specific queries.
Evidence (test 5): Searching for "lockf" returned 8 hits scoring 0.128–0.152 — a 0.024 spread. For a term that appears 1 time in some sessions and 15+ times in others, the BM25 score differences should be 10x larger.
Likely the score normalization (the
1 / (1 + |bm25|)step in QMD'ssearchFTS) is mapping the natural range too aggressively. Could also be a chunking effect: if BM25 runs per-chunk and chunks have uniform size, term-frequency normalization within chunks washes out per-document signal.Possible fixes:
1 - 1/(1 + 0.1*|bm25|)widens the range. Tune empirically.--raw-scoresflag for diagnostics — never user-facing but useful when debugging.Probably an upstream QMD concern more than a Smriti one. File against
tobi/qmdif confirmed.🟡 P2 — "Ingested but not searchable" UX gap
Severity: Confusing edge case. Affects sessions during the gap between ingest and embed.
Evidence: The current Claude session (
4a283f66-...) appears insmriti list --project smriti --limit 10(test 4) but doesn't appear in any of the semantic recall tests (1, 3). It's been ingested as messages but hasn't been chunked + embedded yet. The user has no way to know which state a session is in.Possible fixes:
smriti list: add avector_statecolumn (embedded | pending | failed). User can see at a glance which sessions are full-search-ready.qmd embed --batch <project>for the newly-written chunks. Adds latency to the post-ingest path but closes the UX gap.🟡 P2 — Silent failure when Ollama is unreachable
Severity: User runs
--synthesize, gets raw search output, doesn't realize synthesis silently no-op'd.Evidence: Before starting Ollama,
smriti recall "..." --synthesizereturned only the raw recall hits, no synthesis section, no error message. The synthesis call presumably timed out or refused-connection'd, and the catch handler just swallowed the failure.Possible fixes:
http://127.0.0.1:11434/api/tags(or whatever) and if it fails, print a clear "Ollama unreachable at , returning raw recall hits. Runollama serveto enable synthesis." Continue without throwing. ~10 LOC.Repro recipe (regression check for future releases)
These five tests should be runnable on the corpus before each release. Recommend they be added to
docs/internal/eval-suite.mdalong with this issue's ground truths.Score each on the five-dimension rubric above. If precision drops or hallucinations appear, block the release.
What I'd actually do, in order
Do not block v0.8.0 on any of these. The daemon is a write-side feature; nothing here is a regression vs. the current state. But every line of this issue gets more important as v0.8.0 starts capturing more data without user effort.
Acceptance: how we'd know quality is fixed
Run the same five queries on the same corpus three releases later. The bar:
When all five pass on a fresh corpus, this issue closes.