fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup#118
Open
sroussey wants to merge 1 commit into
Open
fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup#118sroussey wants to merge 1 commit into
sroussey wants to merge 1 commit into
Conversation
…ate-CIK dedup
- Extract shared str/num helpers (scalar + wrapped) into _valueHelpers.ts
so Form 144 and OwnershipDocument share the same null-on-empty/whitespace
semantics. Form 144's previous local num() treated "" as null but a
whitespace-only value would Number(" ")=0, fabricating a 0 in DB.
- Bump Form_144.storage extractor_version 1.0.0 -> 1.1.0 to trigger
re-extraction. OwnershipDocument behaviour is unchanged so its version
stays.
- In the Postgres cik_names bulk writer, dedup duplicate CIKs per slice
(last value wins) before building the multi-row INSERT, so an in-batch
duplicate cik no longer trips ON CONFLICT once-per-statement.
11 tasks
sroussey
added a commit
that referenced
this pull request
Jun 2, 2026
Plan H part 1 of 5 — split across multiple commits due to the push-files tool size limit; logically one fix. See later commits for Form_1_A tests, Form_1_K, Form_1_Z, and Form_C. Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/ strWrapped) plus its unit tests. numScalar treats empty / whitespace- only / NaN / Infinity input as null and rejects thousand-separator strings; legitimate "0" round-trips so the regression guard holds. NOTE: this is an inline copy of the helpers introduced by PR #118 under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges the duplicate should be removed in favour of a single shared module. Also switches Form_1_A.schema.ts decimal-type aliases from Type.Number() to Type.String() (4 aliases) so the storage layer can make the null-vs-zero decision per cell with numScalar() instead of Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts is updated to numScalar() every decimal field in processFinancialData and the RegAOfferingHistory build, and the extractor_version is bumped 1.0.0 → 1.1.0 to force re-extraction. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Open
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent High-priority data-integrity fixes in the insider-trading extractor and the CIK-names bulk writer.
Fix 1 — Form 144
num()whitespace + shared helpersThe local
num()helper insideForm_144.storage.tstreated""asnullbut a whitespace-only value (e.g." ","\t\n") slipped past the early-return and reachedNumber(" ")which returns0. EDGAR filings have been observed with whitespace-only numeric elements; the extractor was silently fabricating a0foraggregate_market_value,gross_proceeds,amount_acquired, etc. and stamping it into Postgres as if the filer had actually reported zero.src/sec/forms/insider-trading/_valueHelpers.tsexports two pairs of helpers with intentionally distinct signatures (scalar vs{value}-wrapped) so call sites can't accidentally cross-wire them:strScalar/numScalar— for Form 144's flat string/number leaves.strWrapped/numWrapped— for OwnershipDocument's{ value }leaves.null.Form_144.storage.tsremoves the localstr/numand importsstrScalar as str, numScalar as num. Bumpsextractor_versionfrom"1.0.0"→"1.1.0"so the production version-slot machinery re-runs the extractor against every previously-stored Form 144 and overwrites the fabricated zeros.OwnershipDocument.storage.tsremoves its localstr/numand importsstrWrapped as str, numWrapped as num.extractor_versionis intentionally NOT bumped — that helper already trimmed before the empty-check, so behaviour is byte-for-byte identical and re-extraction would be pure churn._valueHelpers.test.tspins null-on-empty/whitespace, finite-only coercion, and the scalar/wrapped boundary (including{value:" "},{value:undefined},{}).Form_144.storage.test.tsgets parallel whitespace-only regression tests foraggregateMarketValue,grossProceeds(recent sales), andamountOfSecuritiesAcquired(acquisitions).OwnershipDocument.storage.test.tsgets a parallel whitespace-onlytransactionSharesregression test.Fix 2 — PG
cikNameBulkWriterper-slice dedupcreatePostgresWriter().writeBatchbuilt one multi-rowINSERT ... ON CONFLICT ("cik") DO UPDATEper 30 000-row slice. Postgres rejects a statement that names the same conflict key twice in a single INSERT (ON CONFLICT DO UPDATE command cannot affect row a second time), so a single duplicate CIK incik-lookup-data.txtaborted the whole transaction and lost all ~1M rows for the run. The SQLite branch'sINSERT OR REPLACEalready swallowed in-batch duplicates with last-write-wins.Map<number, string>dedup runs after slicing and only shrinks the row set, so the existing 60 000-bind cap (PG_MAX_ROWS_PER_STATEMENT * 2) still holds — left a comment to that effect.console.debugrecords the drop count when dedup actually fires.PG_MAX_ROWS_PER_STATEMENT = 30_000and the SQLite branch are untouched.FetchAllCikNamesTask.test.tscovers the duplicate-CIK case end-to-end through the repository writer (the in-memory writer it falls through to in tests still exercises the dedup ordering invariant: last value wins, row count shrinks).Follow-up
The same
Number("")===0/Number(" ")===0bug class is plausible in the exempt-offerings extractors (Form C/D/1-A/1-K/1-Z); they were not in scope for this PR but warrant an audit pass and, where a localnum()exists, migration to the shared_valueHelpers.Test plan
bun test src/sec/forms/insider-trading/_valueHelpers.test.ts— new helpers cover all empty/whitespace/non-finite branches for both scalar and wrapped shapes.bun test src/sec/forms/insider-trading/Form_144.storage.test.ts— existing tests pass under the new shared helpers; new whitespace-only tests foraggregate_market_value,gross_proceeds,amount_acquiredall assertnull.bun test src/sec/forms/insider-trading/OwnershipDocument.storage.test.ts— existing empty-element tests still pass (behaviour unchanged); new whitespace-onlytransactionSharestest assertsnull.bun test src/task/ciknames/FetchAllCikNamesTask.test.ts— new duplicate-CIK test asserts the second value wins and the row count collapses to the unique-key count.aggregateMarketValuewas previously0and confirm the new rows areNULLand thatextractor_runsreflects1.1.0.Generated by Claude Code