Skip to content

fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup#118

Open
sroussey wants to merge 1 commit into
mainfrom
claude/wonderful-hypatia-j2Nma
Open

fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup#118
sroussey wants to merge 1 commit into
mainfrom
claude/wonderful-hypatia-j2Nma

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

Summary

Two independent High-priority data-integrity fixes in the insider-trading extractor and the CIK-names bulk writer.

Fix 1 — Form 144 num() whitespace + shared helpers

The local num() helper inside Form_144.storage.ts treated "" as null but a whitespace-only value (e.g. " ", "\t\n") slipped past the early-return and reached Number(" ") which returns 0. EDGAR filings have been observed with whitespace-only numeric elements; the extractor was silently fabricating a 0 for aggregate_market_value, gross_proceeds, amount_acquired, etc. and stamping it into Postgres as if the filer had actually reported zero.

  • New src/sec/forms/insider-trading/_valueHelpers.ts exports two pairs of helpers with intentionally distinct signatures (scalar vs {value}-wrapped) so call sites can't accidentally cross-wire them:
    • strScalar / numScalar — for Form 144's flat string/number leaves.
    • strWrapped / numWrapped — for OwnershipDocument's { value } leaves.
    • All four trim before testing for empty, so whitespace-only values now correctly resolve to null.
  • Form_144.storage.ts removes the local str/num and imports strScalar as str, numScalar as num. Bumps extractor_version from "1.0.0""1.1.0" so the production version-slot machinery re-runs the extractor against every previously-stored Form 144 and overwrites the fabricated zeros.
  • OwnershipDocument.storage.ts removes its local str/num and imports strWrapped as str, numWrapped as num. extractor_version is intentionally NOT bumped — that helper already trimmed before the empty-check, so behaviour is byte-for-byte identical and re-extraction would be pure churn.
  • New _valueHelpers.test.ts pins null-on-empty/whitespace, finite-only coercion, and the scalar/wrapped boundary (including {value:" "}, {value:undefined}, {}).
  • Form_144.storage.test.ts gets parallel whitespace-only regression tests for aggregateMarketValue, grossProceeds (recent sales), and amountOfSecuritiesAcquired (acquisitions).
  • OwnershipDocument.storage.test.ts gets a parallel whitespace-only transactionShares regression test.

Fix 2 — PG cikNameBulkWriter per-slice dedup

createPostgresWriter().writeBatch built one multi-row INSERT ... ON CONFLICT ("cik") DO UPDATE per 30 000-row slice. Postgres rejects a statement that names the same conflict key twice in a single INSERT (ON CONFLICT DO UPDATE command cannot affect row a second time), so a single duplicate CIK in cik-lookup-data.txt aborted the whole transaction and lost all ~1M rows for the run. The SQLite branch's INSERT OR REPLACE already swallowed in-batch duplicates with last-write-wins.

  • Per-slice Map<number, string> dedup runs after slicing and only shrinks the row set, so the existing 60 000-bind cap (PG_MAX_ROWS_PER_STATEMENT * 2) still holds — left a comment to that effect.
  • Last-write-wins ordering matches the SQLite path.
  • console.debug records the drop count when dedup actually fires.
  • PG_MAX_ROWS_PER_STATEMENT = 30_000 and the SQLite branch are untouched.
  • New regression test in FetchAllCikNamesTask.test.ts covers the duplicate-CIK case end-to-end through the repository writer (the in-memory writer it falls through to in tests still exercises the dedup ordering invariant: last value wins, row count shrinks).

Follow-up

The same Number("")===0 / Number(" ")===0 bug class is plausible in the exempt-offerings extractors (Form C/D/1-A/1-K/1-Z); they were not in scope for this PR but warrant an audit pass and, where a local num() exists, migration to the shared _valueHelpers.

Test plan

  • bun test src/sec/forms/insider-trading/_valueHelpers.test.ts — new helpers cover all empty/whitespace/non-finite branches for both scalar and wrapped shapes.
  • bun test src/sec/forms/insider-trading/Form_144.storage.test.ts — existing tests pass under the new shared helpers; new whitespace-only tests for aggregate_market_value, gross_proceeds, amount_acquired all assert null.
  • bun test src/sec/forms/insider-trading/OwnershipDocument.storage.test.ts — existing empty-element tests still pass (behaviour unchanged); new whitespace-only transactionShares test asserts null.
  • bun test src/task/ciknames/FetchAllCikNamesTask.test.ts — new duplicate-CIK test asserts the second value wins and the row count collapses to the unique-key count.
  • On a staging Postgres, re-run the Form 144 extractor against a known sample of filings whose aggregateMarketValue was previously 0 and confirm the new rows are NULL and that extractor_runs reflects 1.1.0.

Generated by Claude Code

…ate-CIK dedup

- Extract shared str/num helpers (scalar + wrapped) into _valueHelpers.ts
  so Form 144 and OwnershipDocument share the same null-on-empty/whitespace
  semantics. Form 144's previous local num() treated "" as null but a
  whitespace-only value would Number("   ")=0, fabricating a 0 in DB.
- Bump Form_144.storage extractor_version 1.0.0 -> 1.1.0 to trigger
  re-extraction. OwnershipDocument behaviour is unchanged so its version
  stays.
- In the Postgres cik_names bulk writer, dedup duplicate CIKs per slice
  (last value wins) before building the multi-row INSERT, so an in-batch
  duplicate cik no longer trips ON CONFLICT once-per-statement.
sroussey added a commit that referenced this pull request Jun 2, 2026
Plan H part 1 of 5 — split across multiple commits due to the
push-files tool size limit; logically one fix. See later commits for
Form_1_A tests, Form_1_K, Form_1_Z, and Form_C.

Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/
strWrapped) plus its unit tests. numScalar treats empty / whitespace-
only / NaN / Infinity input as null and rejects thousand-separator
strings; legitimate "0" round-trips so the regression guard holds.

NOTE: this is an inline copy of the helpers introduced by PR #118
under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges
the duplicate should be removed in favour of a single shared module.

Also switches Form_1_A.schema.ts decimal-type aliases from
Type.Number() to Type.String() (4 aliases) so the storage layer can
make the null-vs-zero decision per cell with numScalar() instead of
Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts
is updated to numScalar() every decimal field in processFinancialData
and the RegAOfferingHistory build, and the extractor_version is bumped
1.0.0 → 1.1.0 to force re-extraction.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant