Skip to content

Fix bulk CIK name writer to respect SEC_DB_TYPE config#111

Merged
sroussey merged 9 commits into
mainfrom
claude/trusting-knuth-B8w2C
May 24, 2026
Merged

Fix bulk CIK name writer to respect SEC_DB_TYPE config#111
sroussey merged 9 commits into
mainfrom
claude/trusting-knuth-B8w2C

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

Summary

Fixes a regression where FetchAllCikNamesTask unconditionally wrote ~1M CIK name rows to a SQLite file via getDb(), even when SEC_DB_TYPE=postgres was configured. The system would then read from PostgreSQL, causing a silent data divergence.

Key Changes

  • New cikNameBulkWriter.ts — Introduces createCikNameBulkWriter() factory that dispatches to backend-specific fast paths:

    • SQLite: single INSERT OR REPLACE transaction (via prepared statement)
    • PostgreSQL: multi-row parameterised INSERT ... ON CONFLICT with 30k-row batching (respects 65535 bind parameter limit)
    • Fallback: repository-backed writer for tests without a real DB configured
  • Updated FetchAllCikNamesTask.ts — Replaces direct getDb() call with createCikNameBulkWriter(), making flush operations async and ensuring writer.close() is called in a finally block.

  • Hardened getDb() in db.ts — Now throws SecCliConfigurationError if SEC_DB_TYPE is not "sqlite", preventing silent data divergence. Callers must use createStorage() / repository tokens or getPgPool() for non-SQLite backends.

  • Updated setupAllDatabases.ts — Guards SQLite-specific view DDL creation behind a dbType === "sqlite" check, since getDb() now throws for other backends.

  • New test coverageFetchAllCikNamesTask.test.ts and db.test.ts verify bulk writer dispatch logic and the getDb() safety guard.

Implementation Details

The bulk writer factory checks SEC_DB_TYPE and SEC_DB_FOLDER registration to pick the right path:

  • If SEC_DB_TYPE=sqlite and SEC_DB_FOLDER is configured → SQLite fast path
  • If SEC_DB_TYPE=postgres → PostgreSQL fast path
  • Otherwise → repository fallback (covers tests and in-memory backends)

This design also serves as a safety net for the getDb() regression: if a caller mistakenly reaches into getDb() when SEC_DB_TYPE=postgres, it will fail loudly rather than silently writing to the wrong database.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

claude added 8 commits May 24, 2026 18:18
FetchAllCikNamesTask reached into getDb() directly for its ~1M-row bulk
insert fast path. getDb() returns a SQLite handle regardless of
SEC_DB_TYPE, so on Postgres backends rows silently landed in a stray
SQLite file that nothing else read from — Postgres cik_names stayed
empty and every downstream CIK-name lookup returned nothing.

- Add a CikNameBulkWriter abstraction that dispatches by SEC_DB_TYPE:
  SQLite keeps the prepared `INSERT OR REPLACE` + transaction fast path;
  Postgres uses multi-row parameterised `INSERT ... ON CONFLICT`
  (capped at 30k rows / 60k bind params per statement, all inside a
  single transaction); tests + missing-config fall through to the
  repository's putBulk against the in-memory backend.
- Make getDb() throw SecCliConfigurationError when SEC_DB_TYPE isn't
  sqlite, so the next caller can't repeat the same silent-divergence
  bug. Gate the canonical-view DDL in setupAllDatabases on
  SEC_DB_TYPE === "sqlite" for the same reason (Postgres view bootstrap
  is a follow-up).
- Tests pin: the writer falls back to the repository under the
  in-memory testing config, overwrites by PK on subsequent batches,
  and handles empty batches; getDb() throws under Postgres.
- CLAUDE.md documents the getDb() guard and the bulk-writer pattern.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
In CompanyNormalization.canonicalEndings the regex was built via a
template literal: `\b${regexp}\b`. Inside a template literal `\b` is the
backspace character (U+0008), not a word-boundary escape. The companion
regexes at the top of the file build the same shape via string
concatenation with "\\b" and work correctly.

Net effect: canonicalisations like "Acme L L C" → "Acme LLC" silently
no-op'd. Punctuated forms like "Acme L.L.C." only converged because
normalizeCompanyName strips dots first; space-separated forms never did.

Switch to string concatenation to match the working pattern elsewhere
in the file. Add a regression test that pins space-separated "L L C"
canonicalises and shares a hash id with "LLC".

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
Every `sec query` subcommand previously called getAll() / records(0) /
listAll() and filtered/sorted/paginated in JS. On the production corpus
(~1M rows in cik_names, millions of filings) every command OOM'd.

Push DB-evaluable predicates down to the storage layer and stream only
when a substring search forces JS-side filtering:

- Add `_streamMatches.ts`: cursored loop over `queryPage(criteria, ...)`
  plus a `collectPage(iter, offset, limit)` helper that stops as soon as
  it has offset+limit matches.
- Extend `QueryResult<T>` with optional `totalApprox: { atLeast,
  exhausted }`. When the iterator was capped before draining, the
  displayed total is a lower bound; the table renderer prints "≥ N" and
  a hint to narrow the filter. Pure-DB paths still return exact totals.
- FilingQuery: cik/form/single-date-bound pushed down via SearchCriteria;
  count() drives the displayed total. Substring on primary_doc_description
  and dual date bounds run in the predicate. Default ordering is filing_date DESC.
- CikQuery: empty needle now uses size() + getOffsetPage() — no scan.
  Exact/prefix/substring stream with a hard MAX_FUZZY_MATCHES = 1000
  cap; surfaces totalApprox when capped. Case-insensitive exact is still
  client-side (workglow equality is case-sensitive).
- PersonQuery: cik (source_filing_issuer_cik) pushed down; search +
  relationship substring stream.
- EntityQuery: cik → repo.get; sic/state pushed via criteria with
  count(); sorted no-filter uses queryPage with orderBy; substring on
  name streams.
- CrowdfundingQuery, OfferingQuery, FactsQuery: same shape — cik (and
  fy/portal where present) push down, substring fields stream.
- TableRenderer: renders "≥ N (streamed; narrow the filter…)" when the
  result carries `totalApprox` with exhausted=false. CLI groups/query.ts
  forwards `result.totalApprox` into the renderer.

Workglow's `query({})` rejects empty criteria with
StorageEmptyCriteriaError, so the unfiltered path uses
`getOffsetPage(offset, limit) + size()` instead. Workglow has no LIKE
operator, so any substring filter must remain client-side; the
streaming + cap pattern is the bound.

Tests: existing query tests pass unchanged; new regression test pins
that the CikQuery empty-needle case never sets `totalApprox`.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
…res config

CSV injection (M2):
TableRenderer.escapeCsvValue handled commas, quotes, and newlines but
not leading =/+/-/@/TAB/CR. Excel/Sheets/Numbers interpret cells
starting with those as formulas, opening WEBSERVICE/HYPERLINK data
exfiltration and external command paths. SEC-sourced fields can carry
arbitrary text (filer-supplied names, descriptions), so a CSV emitted
from `sec query --format csv` was shipping a known attack surface.
Prefix flagged cells with a single quote — spreadsheets render it
literally and hide the prefix; plain CSV readers see the original text
with one apostrophe.

Postgres fast-fail (M4):
assertSecCliEnvConfigured only validated the sqlite path (SEC_DB_FOLDER
+ SEC_DB_NAME). With SEC_DB_TYPE=postgres set but no PG env vars,
commands ran until pg.Pool surfaced a generic connection error
mid-execution, which buried the actual cause. Now require either
SEC_PG_URL or SEC_PG_HOST + SEC_PG_DATABASE and throw
SecCliConfigurationError up-front with an actionable message.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
SecFetchFileOutputCache built file paths via
path.join(folderPath, inputToFileName(input)) with no validation. The
inputToFileName implementations interpolate SEC-supplied values
(primary_doc, fileName, accession numbers) directly — e.g.
SecFetchAccessionDocTask builds `accessiondocs/{cik}/{accNo}-{fileName}`.
A single malformed `primary_doc` containing `../` segments would
escape SEC_RAW_DATA_FOLDER and overwrite arbitrary files on disk; an
absolute path would write anywhere.

The blast radius is small (SEC data isn't an active attacker), but
BootstrapDownloadTask.ts already validates its target dir the same
way, so this is straightforward defense-in-depth.

Add safeJoinWithinFolder() and use it in both saveOutput() and
getOutput(). Tests pin: ../ escapes throw, absolute paths throw, and
legitimate nested subdirectories still work.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
BootstrapSubmissionsTask, BootstrapCompanyFactsTask,
UpdateAllSubmissionsTask, and UpdateAllCompanyFactsTask each computed a
"what's already processed?" set/map by calling
`await processedRepo.getAll()` and iterating. On a production-scale DB
(hundreds of thousands to millions of processed rows) this materialises
every column for every row into one giant array, only to keep cik (and
last_processed for the freshness comparison).

Switch to `for await (const row of repo.records(5000))` so peak memory
is bounded to one page (5k rows) plus the final set/map. The set/map
still grows to N entries, but ~24 bytes/entry is ~25 MB for 1M CIKs vs
several hundred MB for the full row dump.

Extracted streamProcessedCikSet() into src/storage/processing/ for the
two callers (BootstrapSubmissionsTask, BootstrapCompanyFactsTask) that
need only the cik column. The UpdateAll* callers inline `records()`
because they need (cik, last_processed) pairs in a Map.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
…ering

BootstrapDownloadTask used SecFetchJob with response_type: "blob",
which materialised the entire response body in JS heap before
Bun.write() copied it to disk. SEC's submissions.zip and
companyfacts.zip archives are multi-GB; the previous path peaked at
roughly the file size in memory and OOM'd on smaller VMs.

Extract a `streamDownloadToFile()` helper that uses globalThis.fetch +
Bun.file().writer() to pipe the response body straight to disk a chunk
at a time. Peak memory is now bounded by the chunk size (Bun's default
~64KB), regardless of archive size. Progress is surfaced via the
existing context.updateProgress so multi-GB downloads aren't silent.

Tradeoff: bypasses SecFetchJob's queue/retry machinery because workglow's
FetchUrlJob buffers the body regardless of response_type. Bulk download
is a low-frequency, operator-triggered, single-URL path — SEC's 10
req/sec fair-use limit is never a concern for one download, and a
transient failure can be retried by re-running the command. Worth it
for the memory ceiling.

Tests: streaming writes match the source body across multiple chunks;
non-2xx responses throw without writing; missing content-length is
handled (totalBytes undefined, no progress percent).

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
StoreSubmissionFilingsTask called entityRepo.saveFiling() once per
filing inside a JS loop. For 10-K filers with thousands of historical
filings, that's thousands of sequential putBulk-of-one round-trips per
company. On Postgres each round-trip is ~1ms, so submission ingest for
a single high-volume CIK took multiple seconds where it should take
hundreds of milliseconds.

Switch to chunked saveFilingsBulk(500). The chunk size keeps the
per-batch payload well under PostgreSQL's bind-parameter ceiling (~30k
ish given the wide Filing row) while still amortising the round-trip
and the workglow `put` event emit across every batch.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regression where bulk CIK-name ingestion could silently write to a SQLite file even when PostgreSQL was configured, and it adds several related scalability/security hardening improvements across ingestion, bootstrap download, caching, and CLI query pagination.

Changes:

  • Introduces a backend-dispatching CIK-name bulk writer and updates FetchAllCikNamesTask to use it (SQLite fast path, Postgres fast path, repo fallback).
  • Makes SQLite-only access via getDb() fail loudly when SEC_DB_TYPE is not sqlite, and updates DB setup logic accordingly.
  • Improves performance and safety: streamed download-to-disk for large SEC archives, streaming/approx-count query paths for CLI, batch upserts for filings, and cache path traversal protection.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/util/db.ts Makes getDb() SQLite-only by throwing when SEC_DB_TYPE is non-sqlite to prevent silent divergence.
src/util/db.test.ts Adds test coverage for the new getDb() configuration guard.
src/task/submissions/UpdateAllSubmissionsTask.ts Streams processed-submission rows via records() to avoid materializing large tables.
src/task/submissions/StoreSubmissionFilingsTask.ts Switches to chunked bulk upserts for filings to reduce DB round-trips.
src/task/submissions/BootstrapSubmissionsTask.ts Uses a streamed CIK set builder to avoid getAll() on large processed tables.
src/task/facts/UpdateAllCompanyFactsTask.ts Streams processed-facts rows via records() to reduce memory usage.
src/task/facts/BootstrapCompanyFactsTask.ts Uses streamed CIK set builder for processed facts instead of getAll().
src/task/ciknames/FetchAllCikNamesTask.ts Uses the new bulk writer abstraction and ensures writer closure via finally.
src/task/ciknames/FetchAllCikNamesTask.test.ts Adds unit tests pinning bulk-writer fallback and overwrite semantics.
src/task/bootstrap/BootstrapDownloadTask.ts Streams large ZIP downloads directly to disk to avoid buffering multi-GB blobs in memory.
src/task/bootstrap/BootstrapDownloadTask.test.ts Adds tests for streaming download behavior and error handling.
src/storage/processing/processedCikSet.ts Adds helper to stream CIKs into a Set with bounded peak memory.
src/storage/entity/EntityRepo.ts Adds saveFilingsBulk() wrapper over repository putBulk().
src/storage/entity/cikNameBulkWriter.ts Adds backend-specific fast paths for bulk CIK-name ingestion (SQLite/Postgres/repo fallback).
src/storage/company/CompanyNormalization.ts Fixes regex word-boundary handling so canonical suffix normalization actually runs.
src/storage/company/CompanyNormalization.test.ts Adds regression tests for canonicalizing space-separated suffixes (e.g., “L L C”).
src/fetch/SecFetchFileOutputCache.ts Adds safeJoinWithinFolder to prevent cache path traversal writes/reads.
src/fetch/SecFetchFileOutputCache.test.ts Adds tests verifying traversal/absolute-path rejection and legitimate nested paths.
src/config/setupAllDatabases.ts Guards SQLite-only view DDL creation behind an explicit SEC_DB_TYPE === "sqlite" check.
src/config/EnvToDI.ts Strengthens configuration validation for sqlite vs postgres setups (clearer errors).
src/cli/queries/PersonQuery.ts Reworks query to push down DB-filterable criteria and stream predicate-only filters.
src/cli/queries/OfferingQuery.ts Streams predicate-only filters and uses count/query where possible to avoid full loads.
src/cli/queries/FilingQuery.ts Pushes down equality/range where possible; streams for substring search and dual-ended ranges.
src/cli/queries/FactsQuery.ts Uses DB count/query for exact filters; streams when substring filters are required.
src/cli/queries/EntityQuery.ts Adds streaming substring search path and supports approximate totals; adds sort validation.
src/cli/queries/CrowdfundingQuery.ts Pushes down criteria where possible; streams for substring and dual-ended date filters.
src/cli/queries/CikQuery.ts Adds empty-needle pagination fast path and caps fuzzy match collection to bound memory.
src/cli/queries/CikQuery.test.ts Adds regression test ensuring empty-needle pagination avoids scanning.
src/cli/queries/_streamMatches.ts Adds generic streaming helpers for cursor-pagination + collecting an offset/limit window.
src/cli/output/TableRenderer.ts Adds “≥ N” rendering for approximate totals and defuses CSV formula injection.
src/cli/output/TableRenderer.test.ts Adds tests for CSV formula-injection defusing and benign cases.
src/cli/groups/query.ts Plumbs totalApprox through to the table renderer for CLI output.
CLAUDE.md Documents the new SQLite-only getDb() behavior and the recommended backend-branching pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fetch/SecFetchFileOutputCache.ts Outdated
Comment on lines +25 to +29
// `path.relative` returns ".." segments when the candidate escapes the
// base; the equality check covers the exact-base case.
const rel = path.relative(base, candidate);
if (rel.startsWith("..") || path.isAbsolute(rel)) {
throw new Error(
Comment thread src/cli/queries/EntityQuery.ts Outdated
Comment on lines +104 to +108
// No criteria but caller wants a sort — cursor-paginate so we get the
// ordering pushed down. Use queryPage with empty criteria; SQL backends
// override it with a server-side ORDER BY + LIMIT.
const page = await repo.queryPage({}, { orderBy, limit });
return { rows: [...page.items], total };
Comment thread src/cli/queries/CikQuery.ts Outdated
Comment on lines +42 to +47
* 1. **Empty needle** — `count() + getOffsetPage()`. No scan, no
* sorting. The "rank" concept doesn't apply because there's no
* target; rows come back in PK order.
* 2. **Exact match** — `query({ name }) + count({ name })`. Pushes
* equality down to the DB. Only the matching rows are loaded.
* 3. **Prefix / substring** — streams via `records()` because workglow
Comment thread src/cli/queries/CikQuery.ts Outdated
tableEmpty: !anyRowSeen,
};
if (!exhausted) {
(result as { totalApprox?: { atLeast: number; exhausted: boolean } }).totalApprox = {
Comment thread src/storage/entity/cikNameBulkWriter.ts Outdated
Comment on lines +41 to +46
// SQLite fast path needs SEC_DB_FOLDER to be configured (getDb() requires
// it). When the test harness leaves SEC_DB_TYPE=sqlite registered as the
// default but the rest of the production config is absent — e.g. in unit
// tests that exercise this task without standing up a real DB — fall
// through to the repository-backed writer instead of crashing in getDb().
if (dbType === "sqlite" && globalServiceRegistry.has(SEC_DB_FOLDER)) {
Critical:
- init wizard pushed only sqlite vars into process.env before EnvToDI;
  the new Postgres fast-fail (M4) then threw on first run for any user
  configuring Postgres via the wizard, even though .env.local was
  written correctly. Push SEC_PG_URL/HOST/PORT/USER/PASSWORD/DATABASE
  too. (src/cli/groups/init.ts:141-158)

Important:
- EntityQuery sorted-no-criteria path used queryPage({}, {orderBy,
  limit}) which is cursor-based and silently ignored `offset`, so
  `sec query entities --sort name --offset 50` always returned the
  first page. Switch to getAll({orderBy, limit, offset}), which pushes
  ORDER BY/LIMIT/OFFSET to SQL. Regression test added.
  (src/cli/queries/EntityQuery.ts:104-110)
- safeJoinWithinFolder rejected legitimate filenames like "..foo.txt"
  via a plain startsWith("..") check. Anchor on the path separator (and
  exact ".."). Regression test added.
  (src/fetch/SecFetchFileOutputCache.ts:28-39)
- cikNameBulkWriter's sqlite branch gated only on SEC_DB_FOLDER but
  getDb() requires SEC_DB_NAME too — partial test configs would crash
  in getDb() instead of falling back to the in-memory writer. Check
  both. (src/storage/entity/cikNameBulkWriter.ts:46-52)
- CikQuery JSDoc described an "Exact match — query({name})" mode that
  doesn't exist (the implementation streams for both exact and fuzzy
  because workglow equality is case-sensitive). Update the doc.
  Removed the unnecessary `result as { totalApprox?: ... }` cast in
  the same file by switching to conditional spread.
  (src/cli/queries/CikQuery.ts:40-50, 116)
- OfferingQuery `after` predicate worked by accident — null
  date_of_first_sale was filtered out via `("" < params.after)` being
  truthy. Handle null explicitly on both sides for symmetry with the
  `before` branch. (src/cli/queries/OfferingQuery.ts:57-66)
- Six query files always set totalApprox even when the stream drained
  (exhausted: true). The renderer's logic was correct, but consumers
  reading QueryResult.totalApprox couldn't trust its presence as the
  "this is a lower bound" signal. Only emit totalApprox when actually
  capped — matches CikQuery's pattern. (Entity/Filing/Person/
  Crowdfunding/Offering/FactsQuery.ts)

Minor:
- Removed dead empty `if (after && before)` block in FilingQuery
  buildCriteria; lifted comment into the function JSDoc. Also fixed
  HTML-entity-encoded comparison operators in the same comment.
- Pulled PG_MAX_ROWS_PER_STATEMENT to module scope in cikNameBulkWriter.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
@sroussey sroussey merged commit 9185947 into main May 24, 2026
1 check passed
sroussey added a commit that referenced this pull request May 27, 2026
bun test shares module state across files in a worker, and
FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call
EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME /
SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry.
The registry has no unregister API, so once a sibling runs first those
tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its
`has(SEC_DB_FOLDER) === false` precondition and (b) drove
createCikNameBulkWriter() down the SQLite fast path -> getDb() opened
./sec-db/edgar.sqlite and prepared an INSERT against a non-existent
cik_names table, throwing a bun:sqlite prepare error.

This was latent since #111 and only surfaced now because adding
_streamMatches.test.ts in this PR shifted bun's test-file execution
order so the index task tests run before this one in the same worker.

Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres
value in beforeEach so the writer always routes to the repository
(in-memory) writer regardless of any leaked config, and drop the
order-dependent has(SEC_DB_FOLDER) assertion.
sroussey added a commit that referenced this pull request May 27, 2026
)

* fix(cli): make streamed query totalApprox a meaningful lower bound

collectPage now counts every yielded match (up to a shared soft cap)
instead of stopping at offset+limit, so total / totalApprox.atLeast is a
real lower bound rather than a constant page-end value. Both collectPage
and queryCiks now share the single MAX_FUZZY_MATCHES cap exported from
_streamMatches so the two streaming query surfaces report totalApprox
with identical semantics.

* test(cli): assert streamed total is the full match count; fix stale comments

Updates EntityQuery/FilingQuery comments that claimed collectPage stops
after offset+limit matches, and adds tests asserting total equals the
full match count with totalApprox undefined when the stream drains.

* fix(test): isolate cik-name bulk writer test from leaked DB config

bun test shares module state across files in a worker, and
FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call
EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME /
SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry.
The registry has no unregister API, so once a sibling runs first those
tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its
`has(SEC_DB_FOLDER) === false` precondition and (b) drove
createCikNameBulkWriter() down the SQLite fast path -> getDb() opened
./sec-db/edgar.sqlite and prepared an INSERT against a non-existent
cik_names table, throwing a bun:sqlite prepare error.

This was latent since #111 and only surfaced now because adding
_streamMatches.test.ts in this PR shifted bun's test-file execution
order so the index task tests run before this one in the same worker.

Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres
value in beforeEach so the writer always routes to the repository
(in-memory) writer regardless of any leaked config, and drop the
order-dependent has(SEC_DB_FOLDER) assertion.

* fix(cli): fill window past cap in collectPage; simplify totalApprox; clarify test cast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants