Skip to content

Improve data validation, fetch resilience, and database safety#98

Merged
sroussey merged 2 commits intomainfrom
claude/sec-project-code-review-Ys1W0
May 9, 2026
Merged

Improve data validation, fetch resilience, and database safety#98
sroussey merged 2 commits intomainfrom
claude/sec-project-code-review-Ys1W0

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

@sroussey sroussey commented May 9, 2026

Summary

This PR consolidates bad-data detection logic, adds comprehensive retry handling for SEC API fetches, improves database durability, and fixes several data parsing issues across form processors.

Key Changes

Bad Data Validation

  • Centralized bad-data detection: Moved isBadPersonField() from individual form processors (Form_C, Form_D) into a shared bad-data.ts module with an expanded BAD_PERSON_FIELDS set
  • Added support for additional placeholder values ("na", "managing member", "entity", "a delaware limited liability company", etc.)
  • Exported isBadPersonField() function for consistent validation across all form types

Fetch Resilience & Retry Logic

  • Added retry mechanism to SecFetchJob: Implements exponential backoff with jitter for transient failures
    • Retries on HTTP 408, 429, 5xx errors and network-level failures (ECONNRESET, ETIMEDOUT, ENOTFOUND, etc.)
    • Respects Retry-After headers from server responses
    • Per-attempt timeout prevents hung connections from blocking queue slots indefinitely
    • Configurable via MAX_FETCH_ATTEMPTS, INITIAL_BACKOFF_MS, MAX_BACKOFF_MS, and SEC_FETCH_TIMEOUT_MS env var

File Cache Atomicity

  • Atomic cache writes in SecFetchFileOutputCache: Write to temporary file then rename to prevent truncated cache entries and interleaved writes from concurrent processes
  • Improved Blob handling: properly converts between Blob and Buffer during serialization/deserialization

Database Durability

  • Switched SQLite from unsafe to durable mode:
    • Changed journal_mode from OFF to WAL (Write-Ahead Logging)
    • Changed synchronous from 0 (OFF) to NORMAL for crash-safe writes
    • Removed locking_mode = EXCLUSIVE to allow concurrent readers
  • Added closeDb() function to properly close database connections on shutdown
  • Added resetAllDatabases() function for CLI db reset command to truncate all repositories

Data Parsing Improvements

  • Safe integer parsing: Added parseIntegerOrNull() and parseCikSafely() helpers to handle EDGAR's malformed numeric strings (whitespace, non-digit cruft)
  • Form_D: Fixed parsing of totalOfferingAmount and totalRemaining to use safe integer parsing
  • Form_C: Replaced inline isBadPersonField() with centralized version; added parseCikSafely() for portal CIK
  • Form_1_A: Added country code resolution logic to distinguish US state codes from ISO country codes

CLI & Configuration

  • Removed --concurrency flag: Workglow now manages concurrency internally
  • Improved User-Agent handling: Made SecUserAgent configurable via SEC_USER_AGENT env var with better documentation
  • Database initialization: Moved SQLite initialization to command execution phase in commands/index.ts
  • Proper shutdown: Added cleanup in sec.ts to stop task queues and close database/connection pools

Bug Fixes & Improvements

  • Form.ts: Changed _arrayPaths from Map to WeakMap keyed by constructor to prevent collisions between Form subclasses with identical name properties
  • StoreCompanyFactsTask: Fixed logic to record processed facts even when no facts data exists
  • FetchQuarterlyIndexRangeTask: Fixed quarter range calculation to include both endpoints and respect startQuarter
  • SecCachedFetchTask: Ensured response_type is consistently set before cache lookups
  • Added Apache 2.0 license headers to modified files
  • Standardized quote style (single to double quotes) in type definitions

Testing

  • Updated test expectations to remove --concurrency flag checks
  • Removed obsolete test_any.ts file

Notable Implementation Details

  • The retry logic uses AbortSignal.timeout() for per-attempt timeouts and `combineSignals

https://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ

…ng, CLI

Wire up the previously-unregistered EntityHistory and AddressHistoryJunction
repositories so EntityTemporalRepo no longer throws at construction.

Switch SQLite to WAL + synchronous=NORMAL so a crash mid-write can no longer
corrupt the database. Add closeDb()/closePgPool() to the CLI exit path.

Strip the angle brackets from the SEC User-Agent (some EDGAR endpoints 403
on RFC-5322 mailbox forms) and let deployers override it via SEC_USER_AGENT.

Make the file-output cache durable: write to a unique tmp path then atomic
rename, round-trip Blob → Buffer → Blob correctly, and only swallow ENOENT
on read. Cache freshness comment now matches the actual comparison.

Add an HTTP retry-with-Retry-After loop and a per-attempt timeout to
SecFetchJob so transient 429/5xx and hung TCP connections no longer pin a
queue slot. Honor the rate-limiter via the existing CompositeLimiter.

Key Form._arrayPaths by the subclass constructor instead of `this.name`,
removing a latent collision when two Form subclasses share a description.
Centralize the ad-hoc `isBadPersonField` allowlists into one shared
`bad-data.ts` registry, and add NaN-safe parseInt helpers in Form_C/Form_D
storage. Use the actual US state set to derive country_code in Form 1-A
instead of `length === 2`, which silently coerced GB/FR to "US".

Implement `db reset --confirm` (delete-all + setupAllDatabases) and clean up
the CLI: drop the unused `--concurrency` flag, validate CIK args, fix the
inclusive-range math in FetchQuarterlyIndexRangeTask, restore the missing
`date` field in FetchCompanyFactsTask's empty-CIK branch, drop the dead
null-check in StoreCompanyFactsTask, log per-item failures in the
fetchAndStore* glue, and move Sqlite.init() into the preAction hook so
help/version don't load the native binding.

InvestmentOfferingHistorySchema.accession_number is now 20 chars to match
real SEC accession numbers (Postgres would have rejected the old 10).

package.json: rename the misnamed `publish` script to `release`, point
`bin` at `./dist/sec.js` so it ships with the published tarball, and move
concurrently to devDependencies. Delete the leftover `test_any.ts`
scratch file.

https://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ
@sroussey sroussey self-assigned this May 9, 2026
@sroussey sroussey requested a review from Copilot May 9, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens SEC data ingestion by centralizing bad/placeholder-field validation, improving fetch robustness (retries + safer caching), and making database operations more durable and operationally manageable (WAL mode, reset/close helpers), along with several parsing and CLI fixes.

Changes:

  • Centralized EDGAR “bad person field” detection into a shared module and updated form processors to use it.
  • Added retry/backoff and per-attempt timeout handling for SEC fetch jobs; improved file-cache safety via atomic writes and stricter cache-miss error handling.
  • Improved operational durability: SQLite WAL/NORMAL pragmas + DB close helper, new db reset command wiring, safer parsing helpers across form processors.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test_any.ts Removed obsolete debug/test file.
src/util/db.ts Switch SQLite pragmas to WAL/NORMAL; add closeDb().
src/types/edgar/bad-data.ts New shared bad-data detection module and isBadPersonField().
src/task/submissions/fetchAndStoreSubmission.ts Add warning log on per-item pipeline failures.
src/task/index/FetchQuarterlyIndexRangeTask.ts Fix quarter range enumeration to be inclusive and start-quarter anchored.
src/task/facts/StoreCompanyFactsTask.ts Record “processed” even when facts array is empty.
src/task/facts/FetchCompanyFactsTask.ts Preserve date on early-return path.
src/task/facts/fetchAndStoreCompanyFacts.ts Add warning log on failures.
src/storage/investment-offering/InvestmentOfferingHistorySchema.ts Increase accession number max length.
src/sec/forms/Form.ts Use WeakMap keyed by constructor to avoid array-path collisions.
src/sec/forms/exempt-offerings/Form_D.storage.ts Use shared bad-data detection; add safer numeric/CIK parsing helpers; fix amount parsing.
src/sec/forms/exempt-offerings/Form_C.storage.ts Use shared bad-data detection; add safer portal CIK parsing.
src/sec/forms/exempt-offerings/Form_1_A.storage.ts Add country code resolution logic for phone normalization.
src/sec.ts Add shutdown cleanup (stop queues, close DB/PG pool); move SQLite init out.
src/fetch/SecFetchJob.ts Add retry/backoff, Retry-After support, and per-attempt timeouts.
src/fetch/SecFetchFileOutputCache.ts Atomic cache writes; Blob/Buffer handling fixes; don’t swallow non-ENOENT errors.
src/fetch/SecCachedFetchTask.ts Ensure response_type is set consistently before cache usage.
src/config/TestingDI.ts Register new in-memory repositories for tests.
src/config/setupAllDatabases.ts Include setup for new repositories.
src/config/resetAllDatabases.ts New helper to truncate all repositories (used by CLI reset).
src/config/DefaultDI.ts Register new repositories in default DI.
src/config/Constants.ts Make SEC User-Agent configurable via env var.
src/commands/index.ts Move SQLite initialization into command preAction hook.
src/cli/groups/fetch.ts Validate CIK CLI args via helper.
src/cli/groups/db.ts Implement db reset --confirm using new reset helper.
src/cli/GlobalOptions.ts Remove concurrency flag.
src/cli/GlobalOptions.test.ts Update tests for removed concurrency flag.
src/cli/cli.integration.test.ts Update help output expectations for removed concurrency flag.
package.json Point bin to dist build output; move concurrently to devDependencies; rename publish script.
bun.lock Lockfile updates due to dependency move/version resolution changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fetch/SecFetchJob.ts Outdated
Comment on lines +16 to +18
const MAX_FETCH_ATTEMPTS = 4;
const INITIAL_BACKOFF_MS = 1_000;
const MAX_BACKOFF_MS = 30_000;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eMAX_FETCH_ATTEMPTS, INITIAL_BACKOFF_MS, and MAX_BACKOFF_MS are now read from SEC_FETCH_MAX_ATTEMPTS / SEC_FETCH_INITIAL_BACKOFF_MS / SEC_FETCH_MAX_BACKOFF_MS via a shared readPositiveIntEnv helper that validates with /^\d+$/ and falls back to the defaults on missing/invalid input.


Generated by Claude Code

Comment thread src/fetch/SecFetchJob.ts Outdated
Comment on lines +19 to +20
const DEFAULT_TIMEOUT_MS = Number(process.env.SEC_FETCH_TIMEOUT_MS ?? 60_000);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eDEFAULT_TIMEOUT_MS now goes through readPositiveIntEnv("SEC_FETCH_TIMEOUT_MS", 60_000), which requires the env var to match /^\d+$/ and falls back to the default otherwise. Non-numeric input no longer silently disables timeouts.


Generated by Claude Code

Comment thread src/fetch/SecFetchJob.ts Outdated
Comment on lines +121 to +123
const timeoutSignal =
DEFAULT_TIMEOUT_MS > 0 ? AbortSignal.timeout(DEFAULT_TIMEOUT_MS) : undefined;
const signal = combineSignals([context.signal, timeoutSignal]);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21e — replaced AbortSignal.timeout() with a per-attempt AbortController + setTimeout, and clearTimeout runs in the finally so the timer is released on success or failure. Timers no longer accumulate across attempts.


Generated by Claude Code


function parseCikSafely(raw: string | number | undefined | null): number {
if (raw === undefined || raw === null) return 0;
const parsed = Number.parseInt(String(raw).trim(), 10);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eparseCikSafely in Form_D now requires the trimmed string to match /^\d+$/ before calling Number(), so "123abc" returns 0 instead of 123.


Generated by Claude Code

Comment thread src/sec.ts Outdated
Comment on lines 23 to 25
process.exit(1);
}
throw e;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eprocess.exit(1) is gone; the SecCliConfigurationError branch now sets process.exitCode = 1 and stores the error, the finally runs cleanup, and any non-config error is rethrown after cleanup so shutdown actually executes.


Generated by Claude Code

Comment thread src/fetch/SecFetchJob.ts Outdated
Comment on lines +116 to +120
async execute(input: Input, context: IJobExecuteContext): Promise<Output> {
let lastError: unknown;
for (let attempt = 0; attempt < MAX_FETCH_ATTEMPTS; attempt++) {
// Per-attempt timeout so a hung TCP connection cannot pin a queue slot
// forever; respects the caller's abort signal as well.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e5bb21e — three new tests in SecFetchJob.test.ts using a loopback Bun.serve: 429 with Retry-After: 0 retries until success on the third attempt, 503 retries at least once before failing, and 404 fails fast on the first attempt. While diagnosing them I also tightened isRetriableError to detect workglow's RetryableJobError by name/retryable flag and honor its retryDate field for Retry-After (the prior instanceof Error check was unreliable across module boundaries).


Generated by Claude Code

Comment thread src/cli/groups/fetch.ts Outdated
import { runCommand } from "../runCommand";

function parseCikArg(value: string): number {
const parsed = Number.parseInt(value, 10);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eparseCikArg now requires the trimmed argument to match /^\d+$/ before calling Number(), so "123abc" is rejected with the documented error rather than silently becoming 123.


Generated by Claude Code

*/
function parseIntegerOrNull(raw: string | undefined | null): number | null {
if (raw === undefined || raw === null || raw === "") return null;
const parsed = Number.parseInt(String(raw).trim(), 10);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eparseIntegerOrNull now requires the trimmed value to match /^-?\d+$/ before parsing, so "123abc" returns null instead of 123. Also widened the input type to string | number | null | undefined so number inputs (e.g. directly from EDGAR JSON) round-trip correctly.


Generated by Claude Code

Comment on lines +23 to +28
* missing/non-numeric. EDGAR occasionally emits non-digit cruft (whitespace,
* stray punctuation) that `parseInt` would silently turn into `NaN`.
*/
function parseCikSafely(raw: string | undefined | null): number {
if (!raw) return 0;
const parsed = Number.parseInt(String(raw).trim(), 10);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eparseCikSafely in Form_C now requires the trimmed value to match /^\d+$/ before parsing, so "123abc" returns 0 instead of 123.


Generated by Claude Code

Comment on lines +8 to +23
import { US_STATE_CODE_ARRAY } from "../../../storage/address/AddressSchemaCodes";
import { CompanyRepo } from "../../../storage/company/CompanyRepo";
import { hasCompanyEnding } from "../../../storage/company/CompanyNormalization";
import { PersonRepo } from "../../../storage/person/PersonRepo";
import { PhoneRepo } from "../../../storage/phone/PhoneRepo";

const US_STATE_CODE_SET = new Set<string>(US_STATE_CODE_ARRAY.map(([code]) => code));

/**
* EDGAR's `stateOrCountry` field stores either a 2-char US state code (e.g.
* "NY") or a 2-char ISO country code (e.g. "GB"). Both are 2 characters wide,
* so the country can only be inferred from set membership, not from length.
*/
function resolveCountryCode(stateOrCountry: string | undefined | null): string | undefined {
if (!stateOrCountry) return undefined;
return US_STATE_CODE_SET.has(stateOrCountry) ? "US" : stateOrCountry;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5bb21eresolveCountryCode now trims+uppercases the input, then resolves through COUNTRY_STATE_CODE_ARRAY: US states → "US", SEC codes → ISO via the SEC_CODE_TO_ISO map (e.g. "B3""AL", "X0""GB"), already-ISO inputs pass through, and unknown codes return undefined so PhoneRepo falls back to its own defaults instead of receiving a SEC code as a regionCode.


Generated by Claude Code

Tighten CIK/integer parsers in Form_C, Form_D, and the CLI fetch group to
require /^\d+$/ — parseInt would silently accept "123abc" as 123 and store
plausible-but-wrong CIKs.

Make SecFetchJob's retry/timeout knobs env-driven (SEC_FETCH_MAX_ATTEMPTS,
SEC_FETCH_INITIAL_BACKOFF_MS, SEC_FETCH_MAX_BACKOFF_MS, SEC_FETCH_TIMEOUT_MS)
with strict validation that falls back to defaults on missing/invalid values.
Replace AbortSignal.timeout() with a cancellable AbortController +
setTimeout/clearTimeout pair so timers don't leak in a high-throughput queue.
Extract sleepWithAbort() helper that detaches its abort listener on resolve.
Detect workglow's RetryableJobError via name/`retryable` flag (instanceof
Error is unreliable across module boundaries) and honor `retryDate` for
Retry-After. Add unit tests covering 429 + Retry-After retry, 5xx retry,
and 404 fail-fast.

Resolve Form_1_A country codes through COUNTRY_STATE_CODE_ARRAY so phone
parsing receives ISO 3166-1 alpha-2 (the documented contract for
PhoneSchema.country_code) rather than SEC's "B3"/"X0"-style codes.

In sec.ts, replace process.exit(1) with process.exitCode + rethrow so the
shutdown finally block actually runs, and wrap stopQueues/closeDb/closePgPool
in Promise.allSettled so a crashing cleanup step can't mask the primary
command failure.

https://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ
@sroussey sroussey merged commit be864ea into main May 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants