fix(cli,storage,forms): complete CSV escape + TOCTOU + zip leak + exempt-offerings num (sec)#120
Open
sroussey wants to merge 7 commits into
Open
fix(cli,storage,forms): complete CSV escape + TOCTOU + zip leak + exempt-offerings num (sec)#120sroussey wants to merge 7 commits into
sroussey wants to merge 7 commits into
Conversation
Layers two refinements on top of the line-by-line CSV defusing in #119: 1. Tighten LEADING_WS so it only strips space-like characters that spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula leads, so we no longer strip them away before the DANGEROUS_LEAD check — otherwise "\t=cmd" or "\r=cmd" would slip through as non-dangerous after the strip. 2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused. Excel re-parses every physical line of a quoted cell, including lines separated by lone CR. Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM + "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula ("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left alone. All 44 cases in TableRenderer.test.ts pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Three related correctness fixes: 1. PersonObservationRepo / CompanyObservationRepo upsertByNaturalKey had a TOCTOU window — two concurrent upserts on an empty in-memory store would both observe `size()` as 0 and assign observation_id=1 to two different natural keys. Introduce a tiny process-wide AsyncMutex in src/util/AsyncMutex.ts that serializes the INSERT branch only (the existing-row update branch is keyed by observation_id and remains lock-free). Inside the critical section we re-query the natural key before allocating an id so that a parallel upsert of the same natural key still collapses to one row and one id. 2. BootstrapDownloadTask staged the multi-GB zip into SEC_RAW_DATA_FOLDER and only removed it on the success branch. If Bun.spawn threw, or unzip exited non-zero, the archive leaked until the next bootstrap run, silently consuming disk on operator VMs. Wrap the spawn+exit-code block in try/finally and remove the zip (with `force: true`) on every exit path. 3. AsyncMutex itself: trivial queue-based serializer. Rejections in one critical section don't poison the queue because the chained tracker swallows both branches; each caller still observes its own rejection through the returned promise. Tests: - PersonObservationRepo.test.ts / CompanyObservationRepo.test.ts each gain two new cases: concurrent INSERT-on-empty must yield distinct sequential ids, and concurrent UPDATE-existing + parallel INSERT must keep the original id stable and give the insert the next id. - BootstrapDownloadTask.test.ts gains an "execute zip cleanup" describe block that stubs fetch + Bun.spawn + Bun.which and asserts the zip is removed on (a) spawn-throws, (b) unzip-exits-nonzero, and (c) the success path. All 24 affected tests pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 1 of 5 — split across multiple commits due to the push-files tool size limit; logically one fix. See later commits for Form_1_A tests, Form_1_K, Form_1_Z, and Form_C. Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/ strWrapped) plus its unit tests. numScalar treats empty / whitespace- only / NaN / Infinity input as null and rejects thousand-separator strings; legitimate "0" round-trips so the regression guard holds. NOTE: this is an inline copy of the helpers introduced by PR #118 under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges the duplicate should be removed in favour of a single shared module. Also switches Form_1_A.schema.ts decimal-type aliases from Type.Number() to Type.String() (4 aliases) so the storage layer can make the null-vs-zero decision per cell with numScalar() instead of Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts is updated to numScalar() every decimal field in processFinancialData and the RegAOfferingHistory build, and the extractor_version is bumped 1.0.0 → 1.1.0 to force re-extraction. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 2 of 5 — Form_1_A test updates. Form_1_A.test.ts: decimal-typed leaves now arrive as raw strings from the parser; assertions updated accordingly (pricePerSecurity, audit/ legal/etc. fees). New "validate financial data types and ranges" split into decimal-string vs integer-number checks. Form_1_A.storage.test.ts: three new regression tests pin the null / empty / "0" round-trip behaviour through processForm1A. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 3 of 5 — Form_1_K. Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to Type.String(). Storage (processOfferingHistory): price_per_security, aggregate_offering_price, aggregate_offering_price_holders, and estimated_net_amount now flow through numScalar(). extractor_version bumped 1.0.0 → 1.1.0 with the standard re-extract comment. Tests: parsing assertions updated for the new string contract; new storage regression tests for null/empty/"0" round-trip. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 4 of 5 — Form_1_Z. Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to Type.String(). Storage: processOfferingSummaries now flows price_per_security, portionSecuritiesSoldIssuer/Securityholders, and issuerNetProceeds through numScalar(). processCertificationSuspension's approxRecordHolders also goes through numScalar() now (with a null-check instead of an undefined-check so the row is dropped when the input is empty/whitespace). extractor_version 1.0.0 → 1.1.0. Tests: parsing assertions updated for string contract; new storage regression tests for null/empty/"0" round-trip. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 5 of 5 — Form_C and final wrap-up. Schema: 4 DECIMAL_TYPE aliases switch from Type.Number() to Type.String() — including the previously minimum-0 DECIMAL_TYPE7_2_ NONNEGATIVE. The minimum-0 invariant is no longer expressible at the schema level; that constraint becomes the extractor's responsibility (see Form_C.storage.ts numScalar() use). Storage: - processOfferingInfo now flows price, offering_amount, and maximum_offering_amount through numScalar(); priceDeterminationMethod remains a passthrough. - processAnnualReportDisclosures now flows every disclosure field through numScalar() and drops the row when the result is null, instead of writing a fabricated 0 disclosure_value. - extractor_version 1.0.0 → 1.1.0. Tests: parsing assertions updated for the new string contract on price/offeringAmount/maximumOfferingAmount/currentEmployees/ totalAssetMostRecentFiscalYear/revenueMostRecentFiscalYear. New storage regression tests cover null/empty/"0" round-trip for both offering and disclosure numerics. Whole-suite verification (bun test) passes 845/845 with the helpers, schemas, storage, and tests all in place. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles three independently-rooted correctness fixes — one CRITICAL CLI hardening, one HIGH-severity storage TOCTOU + a HIGH-severity disk leak, and one HIGH-severity data-correctness bug in the exempt-offerings extractors. All seven commits land on
claude/wonderful-hypatia-gdhUE; the full suite (bun test) passes 845/845.Plan F — Complete CSV escape hardening (CRITICAL)
src/cli/output/TableRenderer.tswas vulnerable to CSV / spreadsheet-formula injection via three bypasses on top of PR #119's line-by-line defusing:\tand\rare themselves dangerous formula leads, so we no longer strip them before theDANGEROUS_LEADcheck; otherwise"\t=cmd"or"\r=cmd"slipped through as non-dangerous after the strip.value.split(/(\r?\n)/)→value.split(/(\r\n|\r|\n)/)so a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused.Tests in
src/cli/output/TableRenderer.test.tsnow cover all 7 zero-width prefixes + the bare-CR scenarios + a ZWSP-then-benign negative control. 44/44 cases pass.Plan G — TOCTOU PK race + bootstrap zip leak (HIGH × 2)
src/util/AsyncMutex.ts: trivial promise-queue mutex with rejection-isolated chaining so one failing critical section doesn't poison subsequent calls.PersonObservationRepo/CompanyObservationRepoupsertByNaturalKey: wrap the INSERT branch in a process-wide staticAsyncMutexand re-check the natural key inside the critical section. Without the fix, two concurrent inserts on an empty in-memory store both observedsize()as 0 and assignedobservation_id = 1to two different natural keys. Update-of-existing rows remain lock-free.BootstrapDownloadTask: wrap theBun.spawn(unzip)+ exit-code check intry { ... } finally { rmSync(zipPath, { force: true }); }. The multi-GB staged zip used to leak wheneverBun.spawnthrew orunzipexited non-zero; now it's removed on every exit path.New tests:
PersonObservationRepo.test.ts/CompanyObservationRepo.test.ts: concurrent INSERT-on-empty must yield distinct sequential ids; concurrent UPDATE-existing + parallel INSERT must keep the original id stable.BootstrapDownloadTask.test.tsgains an "execute zip cleanup" describe block: stubs fetch +Bun.spawn+Bun.which, asserts zip removed on (a) spawn-throws, (b) unzip-exits-nonzero, (c) success.Plan H — Exempt-offerings
num()bug (HIGH × 1, broad scope)The four exempt-offering schemas (Form 1-A, 1-K, 1-Z, C) typed every decimal XML leaf as
Type.Number(). When EDGAR shipped an empty or whitespace-only element,Value.ConvertonType.Number()silently coerced it to0, which then propagated throughRegAOfferingHistory,RegAFinancialData,CrowdfundingOfferings, andCrowdfundingReportsas a legitimate-looking zero-dollar amount.src/sec/forms/_valueHelpers.tswithstrScalar/numScalar/strWrapped/numWrapped.numScalartreats empty / whitespace-only /NaN/Infinityinput asnulland rejects thousand-separator strings; legitimate"0"round-trips. Inline copy of the helpers PR fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup #118 introduced undersrc/sec/forms/insider-trading/_valueHelpers.ts— when fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup #118 merges, the duplicate should be reconciled to a single shared module.DECIMAL_TYPE*aliases fromType.Number()toType.String()in all four schemas (1-A: 4 aliases, 1-K: 2, 1-Z: 2, C: 4).Type.Integer()aliases stay untouched.numScalar(...)inForm_1_A.storage.ts(processFinancialData,processForm1Ahistory build),Form_1_K.storage.ts(processOfferingHistory),Form_1_Z.storage.ts(processOfferingSummaries,processCertificationSuspension.approxRecordHolders), andForm_C.storage.ts(processOfferingInfo,processAnnualReportDisclosures).extractor_versionbumped 1.0.0 → 1.1.0 in all four with the standard re-extract comment.*.storage.test.tscover null / empty /"0"round-trip.Plan H is intentionally split across 5 commits (parts 1/5–5/5) due to the push-files tool size limit; the diffs taken together are exactly one Plan H.
Findings closed
TableRenderer.escapeCsvValue(3 follow-up failing cases on top of fix(cli): harden CSV formula-injection escape (sec) #119:\t/\rleads, bare-CR multi-line lines, zero-width-prefix bypasses).observation_idinPersonObservationRepoandCompanyObservationRepo.Bun.spawnfailure or non-zerounzipexit.0in offering history, financial data, and crowdfunding disclosures.Reconciliation with open PRs
This PR shares scope with #118 (insider-trading
_valueHelpers) and #119 (CSV escape line-by-line defusing). Neither is closed by this PR — they should be reconciled at merge time:src/sec/forms/insider-trading/_valueHelpers.ts. This PR adds an inline copy atsrc/sec/forms/_valueHelpers.tsand uses it from the exempt-offering storage extractors. When fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup #118 lands, the duplicate copy here should be removed and the imports rewritten to point at the shared module (probably promoted tosrc/sec/forms/_valueHelpers.tsonce both branches merge).TableRenderer.escapeCsvValue. Plan F layers two refinements on top — tightenedLEADING_WSand\r\n | \r | \nsplit. The merge should preserve the fix(cli): harden CSV formula-injection escape (sec) #119 base structure and apply the Plan F deltas.Test plan
bun test src/cli/output/TableRenderer.test.ts— 44 cases pass.bun test src/storage/observation— Person+Company repo tests pass, including new TOCTOU regression cases.bun test src/task/bootstrap— streamDownloadToFile cases plus new zip-cleanup cases pass.bun test src/sec/forms/_valueHelpers.test.ts— 32/32 helper unit tests pass.bun test src/sec/forms/exempt-offerings— 195/195 parse + storage tests pass, including new null/empty/"0" regression guards per form.bun testwhole-suite — 845/845.https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Generated by Claude Code