Skip to content

fix(cli): make streamed query totalApprox a meaningful lower bound#112

Merged
sroussey merged 4 commits into
mainfrom
claude/wonderful-hypatia-HTYiA
May 27, 2026
Merged

fix(cli): make streamed query totalApprox a meaningful lower bound#112
sroussey merged 4 commits into
mainfrom
claude/wonderful-hypatia-HTYiA

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

Summary

Addresses two HIGH review findings on the streaming CLI query path added in #111.

  • H1collectPage in src/cli/queries/_streamMatches.ts stopped at exactly offset + limit matches and returned that as total, so consumers reported totalApprox.atLeast = offset + limit, a constant. The "≥ N" UX (rendered by TableRenderer.ts) was therefore meaningless — it always showed the page end.
  • H2CikQuery.ts instead drained up to MAX_FUZZY_MATCHES = 1000 and reported a real count, so the two streaming query surfaces exposed totalApprox with different meanings, and each declared its own cap.

Decisions applied

  • Single shared soft cap. MAX_FUZZY_MATCHES = 1000 is now exported from src/cli/queries/_streamMatches.ts (moved out of CikQuery.ts). Both collectPage (as the default maxScan) and queryCiks reference it, so there is one source of truth and both surfaces report totalApprox with identical semantics.
  • totalApprox stays table-only. No change to renderJson / CSV output. Limitation: JSON and CSV consumers still receive only the numeric total (a lower bound when the cap fires) without the "this is approximate" signal; surfacing it there is out of scope for this fix.

Changes per file

  • _streamMatches.ts — Rewrote collectPage(iter, offset, limit, maxScan = MAX_FUZZY_MATCHES) to count every yielded match, retain only rows in [offset, offset + limit) (O(limit) memory), stop at the cap (exhausted: false) or when the iterator drains (exhausted: true), and return total = matched. Exported the shared MAX_FUZZY_MATCHES constant and updated docstrings.
  • CikQuery.ts — Now imports the shared MAX_FUZZY_MATCHES instead of declaring its own. Added a comment explaining why CikQuery buffers all matches up to the cap (ranking reorders the whole set) while collectPage buffers only the window. The empty-needle fast path still never sets totalApprox (unchanged).
  • EntityQuery.ts / FilingQuery.ts — Fixed stale comments that claimed the stream "stops after offset + limit matches". QueryResult.total docstring clarified. The six consumers keep the exhausted ? {...} : {..., totalApprox} pattern, which becomes correct now that total is real. EntityQuery's post-collect sort behavior is unchanged (still sorts the window only); only its comment was corrected.
  • Facts / Offering / Crowdfunding / Person Query — no code change needed; pattern already correct (verified).
  • TableRenderer.ts — no change; confirmed the "≥ N" branch reads correctly now that total is meaningful.
  • _streamMatches.test.ts (new) — unit tests for collectPage: full count below cap, cap fires (total === maxScan, exhausted: false), offset past match set, empty iterator, O(limit) window, and default-cap behavior.
  • EntityQuery.test.ts / FilingQuery.test.ts / PersonQuery.test.ts — added assertions that a streamed search reports the FULL match count (not offset + limit) with totalApprox undefined when the stream drains. CikQuery's existing "empty needle never sets totalApprox" test is preserved untouched.

Test plan

  • CI: collectPage unit tests pass (full count, cap, offset-past, empty, O(limit), default cap)
  • CI: Entity/Filing/Person streamed-search tests assert full total + undefined totalApprox
  • CI: existing CikQuery tests still pass (including empty-needle-no-totalApprox)
  • CI: typecheck/lint clean

Note: tests and the build were not run in this environment (no checkout, no toolchain). Implementation was done carefully against the source read at main; CI must validate.


Generated by Claude Code

sroussey added 2 commits May 27, 2026 01:13
collectPage now counts every yielded match (up to a shared soft cap)
instead of stopping at offset+limit, so total / totalApprox.atLeast is a
real lower bound rather than a constant page-end value. Both collectPage
and queryCiks now share the single MAX_FUZZY_MATCHES cap exported from
_streamMatches so the two streaming query surfaces report totalApprox
with identical semantics.
…omments

Updates EntityQuery/FilingQuery comments that claimed collectPage stops
after offset+limit matches, and adds tests asserting total equals the
full match count with totalApprox undefined when the stream drains.
@sroussey sroussey self-assigned this May 27, 2026
bun test shares module state across files in a worker, and
FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call
EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME /
SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry.
The registry has no unregister API, so once a sibling runs first those
tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its
`has(SEC_DB_FOLDER) === false` precondition and (b) drove
createCikNameBulkWriter() down the SQLite fast path -> getDb() opened
./sec-db/edgar.sqlite and prepared an INSERT against a non-existent
cik_names table, throwing a bun:sqlite prepare error.

This was latent since #111 and only surfaced now because adding
_streamMatches.test.ts in this PR shifted bun's test-file execution
order so the index task tests run before this one in the same worker.

Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres
value in beforeEach so the writer always routes to the repository
(in-memory) writer regardless of any leaked config, and drop the
order-dependent has(SEC_DB_FOLDER) assertion.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the streaming CLI query pagination/counting semantics so total becomes a meaningful lower bound (rather than a constant offset+limit) and aligns the behavior across streaming query surfaces by sharing a single soft cap constant.

Changes:

  • Reworked collectPage() to count matches up to a shared soft cap while retaining only the requested [offset, offset+limit) window (O(limit) memory), and exported MAX_FUZZY_MATCHES.
  • Updated CikQuery to import/use the shared cap and documented why it must buffer-and-sort the whole capped match set.
  • Added/updated unit tests to assert streamed searches report full totals when the iterator drains (and omit totalApprox in that case).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/task/ciknames/FetchAllCikNamesTask.test.ts Pins SEC_DB_TYPE in tests to avoid leaked registry config affecting bulk-writer selection.
src/cli/queries/PersonQuery.test.ts Adds assertion that drained streaming searches return full total and no totalApprox.
src/cli/queries/FilingQuery.ts Updates streaming-path comment to reflect new collectPage semantics.
src/cli/queries/FilingQuery.test.ts Adds assertion that drained streaming searches return full total and no totalApprox.
src/cli/queries/EntityQuery.ts Clarifies QueryResult.total/totalApprox semantics and updates streaming-path comments.
src/cli/queries/EntityQuery.test.ts Adds regression test that streamed totals are full-count (not offset+limit).
src/cli/queries/CikQuery.ts Imports shared MAX_FUZZY_MATCHES and documents buffering rationale for ranking.
src/cli/queries/_streamMatches.ts Exports shared cap and rewrites collectPage() to count matches up to cap while window-buffering.
src/cli/queries/_streamMatches.test.ts New unit tests covering collectPage() totals, cap behavior, and default-cap behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +91 to 95
matched++;
if (matched >= maxScan) {
// Hit the soft cap: there may be more matches we never counted.
return { rows: window, total: matched, exhausted: false };
}
Comment thread src/cli/queries/EntityQuery.ts Outdated
/** The match count we got to before stopping. */
/** The match count we got to before stopping (the soft cap). */
readonly atLeast: number;
/** True if the iterator drained — in which case `total` is exact. */
// API, so without this pin a leaked sqlite config would route the
// writer down the getDb() fast path and prepare an INSERT against a
// cik_names table that does not exist in the test SQLite file.
globalServiceRegistry.registerInstance(SEC_DB_TYPE, "memory" as unknown as "sqlite");
@sroussey sroussey merged commit a820d37 into main May 27, 2026
1 check passed
@sroussey sroussey deleted the claude/wonderful-hypatia-HTYiA branch May 27, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants