Fix bulk CIK name writer to respect SEC_DB_TYPE config by sroussey · Pull Request #111 · workglow-dev/sec

sroussey · 2026-05-24T18:32:36Z

Summary

Fixes a regression where FetchAllCikNamesTask unconditionally wrote ~1M CIK name rows to a SQLite file via getDb(), even when SEC_DB_TYPE=postgres was configured. The system would then read from PostgreSQL, causing a silent data divergence.

Key Changes

New cikNameBulkWriter.ts — Introduces createCikNameBulkWriter() factory that dispatches to backend-specific fast paths:
- SQLite: single INSERT OR REPLACE transaction (via prepared statement)
- PostgreSQL: multi-row parameterised INSERT ... ON CONFLICT with 30k-row batching (respects 65535 bind parameter limit)
- Fallback: repository-backed writer for tests without a real DB configured
Updated FetchAllCikNamesTask.ts — Replaces direct getDb() call with createCikNameBulkWriter(), making flush operations async and ensuring writer.close() is called in a finally block.
Hardened getDb() in db.ts — Now throws SecCliConfigurationError if SEC_DB_TYPE is not "sqlite", preventing silent data divergence. Callers must use createStorage() / repository tokens or getPgPool() for non-SQLite backends.
Updated setupAllDatabases.ts — Guards SQLite-specific view DDL creation behind a dbType === "sqlite" check, since getDb() now throws for other backends.
New test coverage — FetchAllCikNamesTask.test.ts and db.test.ts verify bulk writer dispatch logic and the getDb() safety guard.

Implementation Details

The bulk writer factory checks SEC_DB_TYPE and SEC_DB_FOLDER registration to pick the right path:

If SEC_DB_TYPE=sqlite and SEC_DB_FOLDER is configured → SQLite fast path
If SEC_DB_TYPE=postgres → PostgreSQL fast path
Otherwise → repository fallback (covers tests and in-memory backends)

This design also serves as a safety net for the getDb() regression: if a caller mistakenly reaches into getDb() when SEC_DB_TYPE=postgres, it will fail loudly rather than silently writing to the wrong database.

https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

FetchAllCikNamesTask reached into getDb() directly for its ~1M-row bulk insert fast path. getDb() returns a SQLite handle regardless of SEC_DB_TYPE, so on Postgres backends rows silently landed in a stray SQLite file that nothing else read from — Postgres cik_names stayed empty and every downstream CIK-name lookup returned nothing. - Add a CikNameBulkWriter abstraction that dispatches by SEC_DB_TYPE: SQLite keeps the prepared `INSERT OR REPLACE` + transaction fast path; Postgres uses multi-row parameterised `INSERT ... ON CONFLICT` (capped at 30k rows / 60k bind params per statement, all inside a single transaction); tests + missing-config fall through to the repository's putBulk against the in-memory backend. - Make getDb() throw SecCliConfigurationError when SEC_DB_TYPE isn't sqlite, so the next caller can't repeat the same silent-divergence bug. Gate the canonical-view DDL in setupAllDatabases on SEC_DB_TYPE === "sqlite" for the same reason (Postgres view bootstrap is a follow-up). - Tests pin: the writer falls back to the repository under the in-memory testing config, overwrites by PK on subsequent batches, and handles empty batches; getDb() throws under Postgres. - CLAUDE.md documents the getDb() guard and the bulk-writer pattern. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

In CompanyNormalization.canonicalEndings the regex was built via a template literal: `\b${regexp}\b`. Inside a template literal `\b` is the backspace character (U+0008), not a word-boundary escape. The companion regexes at the top of the file build the same shape via string concatenation with "\\b" and work correctly. Net effect: canonicalisations like "Acme L L C" → "Acme LLC" silently no-op'd. Punctuated forms like "Acme L.L.C." only converged because normalizeCompanyName strips dots first; space-separated forms never did. Switch to string concatenation to match the working pattern elsewhere in the file. Add a regression test that pins space-separated "L L C" canonicalises and shares a hash id with "LLC". https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

Every `sec query` subcommand previously called getAll() / records(0) / listAll() and filtered/sorted/paginated in JS. On the production corpus (~1M rows in cik_names, millions of filings) every command OOM'd. Push DB-evaluable predicates down to the storage layer and stream only when a substring search forces JS-side filtering: - Add `_streamMatches.ts`: cursored loop over `queryPage(criteria, ...)` plus a `collectPage(iter, offset, limit)` helper that stops as soon as it has offset+limit matches. - Extend `QueryResult<T>` with optional `totalApprox: { atLeast, exhausted }`. When the iterator was capped before draining, the displayed total is a lower bound; the table renderer prints "≥ N" and a hint to narrow the filter. Pure-DB paths still return exact totals. - FilingQuery: cik/form/single-date-bound pushed down via SearchCriteria; count() drives the displayed total. Substring on primary_doc_description and dual date bounds run in the predicate. Default ordering is filing_date DESC. - CikQuery: empty needle now uses size() + getOffsetPage() — no scan. Exact/prefix/substring stream with a hard MAX_FUZZY_MATCHES = 1000 cap; surfaces totalApprox when capped. Case-insensitive exact is still client-side (workglow equality is case-sensitive). - PersonQuery: cik (source_filing_issuer_cik) pushed down; search + relationship substring stream. - EntityQuery: cik → repo.get; sic/state pushed via criteria with count(); sorted no-filter uses queryPage with orderBy; substring on name streams. - CrowdfundingQuery, OfferingQuery, FactsQuery: same shape — cik (and fy/portal where present) push down, substring fields stream. - TableRenderer: renders "≥ N (streamed; narrow the filter…)" when the result carries `totalApprox` with exhausted=false. CLI groups/query.ts forwards `result.totalApprox` into the renderer. Workglow's `query({})` rejects empty criteria with StorageEmptyCriteriaError, so the unfiltered path uses `getOffsetPage(offset, limit) + size()` instead. Workglow has no LIKE operator, so any substring filter must remain client-side; the streaming + cap pattern is the bound. Tests: existing query tests pass unchanged; new regression test pins that the CikQuery empty-needle case never sets `totalApprox`. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

…res config CSV injection (M2): TableRenderer.escapeCsvValue handled commas, quotes, and newlines but not leading =/+/-/@/TAB/CR. Excel/Sheets/Numbers interpret cells starting with those as formulas, opening WEBSERVICE/HYPERLINK data exfiltration and external command paths. SEC-sourced fields can carry arbitrary text (filer-supplied names, descriptions), so a CSV emitted from `sec query --format csv` was shipping a known attack surface. Prefix flagged cells with a single quote — spreadsheets render it literally and hide the prefix; plain CSV readers see the original text with one apostrophe. Postgres fast-fail (M4): assertSecCliEnvConfigured only validated the sqlite path (SEC_DB_FOLDER + SEC_DB_NAME). With SEC_DB_TYPE=postgres set but no PG env vars, commands ran until pg.Pool surfaced a generic connection error mid-execution, which buried the actual cause. Now require either SEC_PG_URL or SEC_PG_HOST + SEC_PG_DATABASE and throw SecCliConfigurationError up-front with an actionable message. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

SecFetchFileOutputCache built file paths via path.join(folderPath, inputToFileName(input)) with no validation. The inputToFileName implementations interpolate SEC-supplied values (primary_doc, fileName, accession numbers) directly — e.g. SecFetchAccessionDocTask builds `accessiondocs/{cik}/{accNo}-{fileName}`. A single malformed `primary_doc` containing `../` segments would escape SEC_RAW_DATA_FOLDER and overwrite arbitrary files on disk; an absolute path would write anywhere. The blast radius is small (SEC data isn't an active attacker), but BootstrapDownloadTask.ts already validates its target dir the same way, so this is straightforward defense-in-depth. Add safeJoinWithinFolder() and use it in both saveOutput() and getOutput(). Tests pin: ../ escapes throw, absolute paths throw, and legitimate nested subdirectories still work. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

BootstrapSubmissionsTask, BootstrapCompanyFactsTask, UpdateAllSubmissionsTask, and UpdateAllCompanyFactsTask each computed a "what's already processed?" set/map by calling `await processedRepo.getAll()` and iterating. On a production-scale DB (hundreds of thousands to millions of processed rows) this materialises every column for every row into one giant array, only to keep cik (and last_processed for the freshness comparison). Switch to `for await (const row of repo.records(5000))` so peak memory is bounded to one page (5k rows) plus the final set/map. The set/map still grows to N entries, but ~24 bytes/entry is ~25 MB for 1M CIKs vs several hundred MB for the full row dump. Extracted streamProcessedCikSet() into src/storage/processing/ for the two callers (BootstrapSubmissionsTask, BootstrapCompanyFactsTask) that need only the cik column. The UpdateAll* callers inline `records()` because they need (cik, last_processed) pairs in a Map. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

…ering BootstrapDownloadTask used SecFetchJob with response_type: "blob", which materialised the entire response body in JS heap before Bun.write() copied it to disk. SEC's submissions.zip and companyfacts.zip archives are multi-GB; the previous path peaked at roughly the file size in memory and OOM'd on smaller VMs. Extract a `streamDownloadToFile()` helper that uses globalThis.fetch + Bun.file().writer() to pipe the response body straight to disk a chunk at a time. Peak memory is now bounded by the chunk size (Bun's default ~64KB), regardless of archive size. Progress is surfaced via the existing context.updateProgress so multi-GB downloads aren't silent. Tradeoff: bypasses SecFetchJob's queue/retry machinery because workglow's FetchUrlJob buffers the body regardless of response_type. Bulk download is a low-frequency, operator-triggered, single-URL path — SEC's 10 req/sec fair-use limit is never a concern for one download, and a transient failure can be retried by re-running the command. Worth it for the memory ceiling. Tests: streaming writes match the source body across multiple chunks; non-2xx responses throw without writing; missing content-length is handled (totalBytes undefined, no progress percent). https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

StoreSubmissionFilingsTask called entityRepo.saveFiling() once per filing inside a JS loop. For 10-K filers with thousands of historical filings, that's thousands of sequential putBulk-of-one round-trips per company. On Postgres each round-trip is ~1ms, so submission ingest for a single high-volume CIK took multiple seconds where it should take hundreds of milliseconds. Switch to chunked saveFilingsBulk(500). The chunk size keeps the per-batch payload well under PostgreSQL's bind-parameter ceiling (~30k ish given the wide Filing row) while still amortising the round-trip and the workglow `put` event emit across every batch. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

Copilot

Pull request overview

This PR fixes a regression where bulk CIK-name ingestion could silently write to a SQLite file even when PostgreSQL was configured, and it adds several related scalability/security hardening improvements across ingestion, bootstrap download, caching, and CLI query pagination.

Changes:

Introduces a backend-dispatching CIK-name bulk writer and updates FetchAllCikNamesTask to use it (SQLite fast path, Postgres fast path, repo fallback).
Makes SQLite-only access via getDb() fail loudly when SEC_DB_TYPE is not sqlite, and updates DB setup logic accordingly.
Improves performance and safety: streamed download-to-disk for large SEC archives, streaming/approx-count query paths for CLI, batch upserts for filings, and cache path traversal protection.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/util/db.ts	Makes `getDb()` SQLite-only by throwing when `SEC_DB_TYPE` is non-sqlite to prevent silent divergence.
src/util/db.test.ts	Adds test coverage for the new `getDb()` configuration guard.
src/task/submissions/UpdateAllSubmissionsTask.ts	Streams processed-submission rows via `records()` to avoid materializing large tables.
src/task/submissions/StoreSubmissionFilingsTask.ts	Switches to chunked bulk upserts for filings to reduce DB round-trips.
src/task/submissions/BootstrapSubmissionsTask.ts	Uses a streamed CIK set builder to avoid `getAll()` on large processed tables.
src/task/facts/UpdateAllCompanyFactsTask.ts	Streams processed-facts rows via `records()` to reduce memory usage.
src/task/facts/BootstrapCompanyFactsTask.ts	Uses streamed CIK set builder for processed facts instead of `getAll()`.
src/task/ciknames/FetchAllCikNamesTask.ts	Uses the new bulk writer abstraction and ensures writer closure via `finally`.
src/task/ciknames/FetchAllCikNamesTask.test.ts	Adds unit tests pinning bulk-writer fallback and overwrite semantics.
src/task/bootstrap/BootstrapDownloadTask.ts	Streams large ZIP downloads directly to disk to avoid buffering multi-GB blobs in memory.
src/task/bootstrap/BootstrapDownloadTask.test.ts	Adds tests for streaming download behavior and error handling.
src/storage/processing/processedCikSet.ts	Adds helper to stream CIKs into a `Set` with bounded peak memory.
src/storage/entity/EntityRepo.ts	Adds `saveFilingsBulk()` wrapper over repository `putBulk()`.
src/storage/entity/cikNameBulkWriter.ts	Adds backend-specific fast paths for bulk CIK-name ingestion (SQLite/Postgres/repo fallback).
src/storage/company/CompanyNormalization.ts	Fixes regex word-boundary handling so canonical suffix normalization actually runs.
src/storage/company/CompanyNormalization.test.ts	Adds regression tests for canonicalizing space-separated suffixes (e.g., “L L C”).
src/fetch/SecFetchFileOutputCache.ts	Adds `safeJoinWithinFolder` to prevent cache path traversal writes/reads.
src/fetch/SecFetchFileOutputCache.test.ts	Adds tests verifying traversal/absolute-path rejection and legitimate nested paths.
src/config/setupAllDatabases.ts	Guards SQLite-only view DDL creation behind an explicit `SEC_DB_TYPE === "sqlite"` check.
src/config/EnvToDI.ts	Strengthens configuration validation for sqlite vs postgres setups (clearer errors).
src/cli/queries/PersonQuery.ts	Reworks query to push down DB-filterable criteria and stream predicate-only filters.
src/cli/queries/OfferingQuery.ts	Streams predicate-only filters and uses `count/query` where possible to avoid full loads.
src/cli/queries/FilingQuery.ts	Pushes down equality/range where possible; streams for substring search and dual-ended ranges.
src/cli/queries/FactsQuery.ts	Uses DB count/query for exact filters; streams when substring filters are required.
src/cli/queries/EntityQuery.ts	Adds streaming substring search path and supports approximate totals; adds sort validation.
src/cli/queries/CrowdfundingQuery.ts	Pushes down criteria where possible; streams for substring and dual-ended date filters.
src/cli/queries/CikQuery.ts	Adds empty-needle pagination fast path and caps fuzzy match collection to bound memory.
src/cli/queries/CikQuery.test.ts	Adds regression test ensuring empty-needle pagination avoids scanning.
src/cli/queries/_streamMatches.ts	Adds generic streaming helpers for cursor-pagination + collecting an offset/limit window.
src/cli/output/TableRenderer.ts	Adds “≥ N” rendering for approximate totals and defuses CSV formula injection.
src/cli/output/TableRenderer.test.ts	Adds tests for CSV formula-injection defusing and benign cases.
src/cli/groups/query.ts	Plumbs `totalApprox` through to the table renderer for CLI output.
CLAUDE.md	Documents the new SQLite-only `getDb()` behavior and the recommended backend-branching pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  // `path.relative` returns ".." segments when the candidate escapes the
+  // base; the equality check covers the exact-base case.
+  const rel = path.relative(base, candidate);
+  if (rel.startsWith("..") || path.isAbsolute(rel)) {
+    throw new Error(


+    // No criteria but caller wants a sort — cursor-paginate so we get the
+    // ordering pushed down. Use queryPage with empty criteria; SQL backends
+    // override it with a server-side ORDER BY + LIMIT.
+    const page = await repo.queryPage({}, { orderBy, limit });
+    return { rows: [...page.items], total };


+ *   1. **Empty needle** — `count() + getOffsetPage()`. No scan, no
+ *      sorting. The "rank" concept doesn't apply because there's no
+ *      target; rows come back in PK order.
+ *   2. **Exact match** — `query({ name }) + count({ name })`. Pushes
+ *      equality down to the DB. Only the matching rows are loaded.
+ *   3. **Prefix / substring** — streams via `records()` because workglow


    tableEmpty: !anyRowSeen,
  };
+  if (!exhausted) {
+    (result as { totalApprox?: { atLeast: number; exhausted: boolean } }).totalApprox = {


+  // SQLite fast path needs SEC_DB_FOLDER to be configured (getDb() requires
+  // it). When the test harness leaves SEC_DB_TYPE=sqlite registered as the
+  // default but the rest of the production config is absent — e.g. in unit
+  // tests that exercise this task without standing up a real DB — fall
+  // through to the repository-backed writer instead of crashing in getDb().
+  if (dbType === "sqlite" && globalServiceRegistry.has(SEC_DB_FOLDER)) {


Critical: - init wizard pushed only sqlite vars into process.env before EnvToDI; the new Postgres fast-fail (M4) then threw on first run for any user configuring Postgres via the wizard, even though .env.local was written correctly. Push SEC_PG_URL/HOST/PORT/USER/PASSWORD/DATABASE too. (src/cli/groups/init.ts:141-158) Important: - EntityQuery sorted-no-criteria path used queryPage({}, {orderBy, limit}) which is cursor-based and silently ignored `offset`, so `sec query entities --sort name --offset 50` always returned the first page. Switch to getAll({orderBy, limit, offset}), which pushes ORDER BY/LIMIT/OFFSET to SQL. Regression test added. (src/cli/queries/EntityQuery.ts:104-110) - safeJoinWithinFolder rejected legitimate filenames like "..foo.txt" via a plain startsWith("..") check. Anchor on the path separator (and exact ".."). Regression test added. (src/fetch/SecFetchFileOutputCache.ts:28-39) - cikNameBulkWriter's sqlite branch gated only on SEC_DB_FOLDER but getDb() requires SEC_DB_NAME too — partial test configs would crash in getDb() instead of falling back to the in-memory writer. Check both. (src/storage/entity/cikNameBulkWriter.ts:46-52) - CikQuery JSDoc described an "Exact match — query({name})" mode that doesn't exist (the implementation streams for both exact and fuzzy because workglow equality is case-sensitive). Update the doc. Removed the unnecessary `result as { totalApprox?: ... }` cast in the same file by switching to conditional spread. (src/cli/queries/CikQuery.ts:40-50, 116) - OfferingQuery `after` predicate worked by accident — null date_of_first_sale was filtered out via `("" < params.after)` being truthy. Handle null explicitly on both sides for symmetry with the `before` branch. (src/cli/queries/OfferingQuery.ts:57-66) - Six query files always set totalApprox even when the stream drained (exhausted: true). The renderer's logic was correct, but consumers reading QueryResult.totalApprox couldn't trust its presence as the "this is a lower bound" signal. Only emit totalApprox when actually capped — matches CikQuery's pattern. (Entity/Filing/Person/ Crowdfunding/Offering/FactsQuery.ts) Minor: - Removed dead empty `if (after && before)` block in FilingQuery buildCriteria; lifted comment into the function JSDoc. Also fixed HTML-entity-encoded comparison operators in the same comment. - Pulled PG_MAX_ROWS_PER_STATEMENT to module scope in cikNameBulkWriter. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu

bun test shares module state across files in a worker, and FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME / SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry. The registry has no unregister API, so once a sibling runs first those tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its `has(SEC_DB_FOLDER) === false` precondition and (b) drove createCikNameBulkWriter() down the SQLite fast path -> getDb() opened ./sec-db/edgar.sqlite and prepared an INSERT against a non-existent cik_names table, throwing a bun:sqlite prepare error. This was latent since #111 and only surfaced now because adding _streamMatches.test.ts in this PR shifted bun's test-file execution order so the index task tests run before this one in the same worker. Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres value in beforeEach so the writer always routes to the repository (in-memory) writer regardless of any leaked config, and drop the order-dependent has(SEC_DB_FOLDER) assertion.

) * fix(cli): make streamed query totalApprox a meaningful lower bound collectPage now counts every yielded match (up to a shared soft cap) instead of stopping at offset+limit, so total / totalApprox.atLeast is a real lower bound rather than a constant page-end value. Both collectPage and queryCiks now share the single MAX_FUZZY_MATCHES cap exported from _streamMatches so the two streaming query surfaces report totalApprox with identical semantics. * test(cli): assert streamed total is the full match count; fix stale comments Updates EntityQuery/FilingQuery comments that claimed collectPage stops after offset+limit matches, and adds tests asserting total equals the full match count with totalApprox undefined when the stream drains. * fix(test): isolate cik-name bulk writer test from leaked DB config bun test shares module state across files in a worker, and FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME / SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry. The registry has no unregister API, so once a sibling runs first those tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its `has(SEC_DB_FOLDER) === false` precondition and (b) drove createCikNameBulkWriter() down the SQLite fast path -> getDb() opened ./sec-db/edgar.sqlite and prepared an INSERT against a non-existent cik_names table, throwing a bun:sqlite prepare error. This was latent since #111 and only surfaced now because adding _streamMatches.test.ts in this PR shifted bun's test-file execution order so the index task tests run before this one in the same worker. Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres value in beforeEach so the writer always routes to the repository (in-memory) writer regardless of any leaked config, and drop the order-dependent has(SEC_DB_FOLDER) assertion. * fix(cli): fill window past cap in collectPage; simplify totalApprox; clarify test cast

claude added 8 commits May 24, 2026 18:18

sroussey requested a review from Copilot May 24, 2026 20:12

sroussey self-assigned this May 24, 2026

Copilot started reviewing on behalf of sroussey May 24, 2026 20:12 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

sroussey merged commit 9185947 into main May 24, 2026
1 check passed

sroussey mentioned this pull request May 27, 2026

fix(cli): make streamed query totalApprox a meaningful lower bound #112

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bulk CIK name writer to respect SEC_DB_TYPE config#111

Fix bulk CIK name writer to respect SEC_DB_TYPE config#111
sroussey merged 9 commits into
mainfrom
claude/trusting-knuth-B8w2C

sroussey commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sroussey commented May 24, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants