Skip to content

feat(lib): parse worker concurrency cap — piscina worker pool#71

Merged
thewrz merged 3 commits into
mainfrom
feat/issue-33-parse-worker-pool
May 18, 2026
Merged

feat(lib): parse worker concurrency cap — piscina worker pool#71
thewrz merged 3 commits into
mainfrom
feat/issue-33-parse-worker-pool

Conversation

@thewrz
Copy link
Copy Markdown
Contributor

@thewrz thewrz commented May 17, 2026

Summary

  • Adds a piscina worker thread pool capping concurrent parse operations at cpu_count - 1 threads — prevents memory exhaustion from concurrent large DOCX payloads
  • src/lib/parse-worker.ts — worker function (parseSec / parseDocx / parseText, no DB); dev uses tsx/esm loader via execArgv, prod uses compiled .js
  • src/lib/parse-pool.ts — pool singleton, dev/prod file resolution via import.meta.url.endsWith('.ts')
  • src/api/parse.tsdispatchParse removed; replaced with parsePool.run() + zero-copy buffer transfer via transferList
  • src/api/parse.test.ts — added vi.mock('../lib/parse-pool.js') so unit tests don't spawn live worker threads

Progress reporting coarsens slightly: per-chunk signals from inside parseDocx no longer surface, but start (10%) and complete (75%) checkpoints are preserved.

Test plan

  • pnpm test src/lib/parse-pool.test.ts — pool maxThreads unit test passes
  • pnpm test — all 424 unit tests pass
  • pnpm lint — ESLint + tsc + prettier clean
  • pnpm build — TypeScript compiles without errors
  • DATABASE_URL=... pnpm test:integration — end-to-end parse path exercises POST /parse → worker → DB

Closes #33

Summary by CodeRabbit

  • New Features

    • Parsing now runs in background worker threads, improving responsiveness and CPU utilization during document processing.
  • Refactor

    • Internal parsing pipeline migrated to a worker-pool design for more efficient concurrent processing and resource management; parsing results are now validated before finalizing.
  • Tests

    • Added tests for the worker pool and parse-worker behavior to improve reliability.
  • Chores

    • Added a runtime dependency to support the new worker-based parsing.

Review Change Stack

Add piscina worker pool capping concurrent parse at cpu_count-1 threads.
Moves parseSec/parseDocx/parseText into parse-worker.ts; main thread
retains DB work and coarse progress (10%/75%). Zero-copy buffer transfer
via transferList. parse.test.ts mocks the pool to avoid live workers.

Closes #33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

A Piscina worker pool is added and exported; a parse-worker module dispatches parsing by file extension and returns { tree, capabilities? }; the main parse handler now calls parsePool.run(...), validates worker output, and updates progress. Tests mock or assert pool behavior.

Changes

Worker Pool for Bounded Concurrent Parsing

Layer / File(s) Summary
Worker pool dependency and configuration
package.json, src/lib/parse-pool.ts, src/lib/parse-pool.test.ts
Adds piscina to dependencies; exports a parsePool configured to use parse-worker.ts/.js, conditionally sets execArgv for TS runtime, and sets maxThreads = Math.max(1, os.cpus().length - 1); test asserts maxThreads >= 1.
Parse worker module with file-type dispatch
src/lib/parse-worker.ts
New worker entrypoint exports WorkerInput/WorkerOutput and default parseWorker that branches on ext: validates and parses .sec, decodes and parses .txt (returns capabilities), handles .docx, and throws on unsupported extensions.
Main parse flow routed through worker pool
src/api/parse.ts, src/api/parse.test.ts
src/api/parse.ts imports parsePool and WorkerOutput, introduces workerOutputSchema (Zod), and replaces in-process parsing with parsePool.run({ buffer, ext }), validating the worker result and adjusting progress updates; tests mock parsePool.run to supply a fixed worker result.

Sequence Diagram(s)

sequenceDiagram
  participant ParseHandler
  participant ParsePool
  participant ParseWorker
  ParseHandler->>ParsePool: run({ buffer, ext })
  ParsePool->>ParseWorker: execute parseWorker({ buffer, ext })
  ParseWorker-->>ParsePool: return { tree, capabilities? }
  ParsePool-->>ParseHandler: resolve worker result
  ParseHandler->>ParseHandler: validate workerOutputSchema & persist CsiTree
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

"I hopped through threads with a careful paw,
Piscina corralled the parsing law,
Workers hum, results return in tune,
No runaway memory—just orderly room. 🥕🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(lib): parse worker concurrency cap — piscina worker pool' clearly summarizes the main change: implementing a worker pool with concurrency capping.
Linked Issues check ✅ Passed All coding requirements from issue #33 are met: piscina added to dependencies, parse-worker.ts and parse-pool.ts created with correct concurrency cap, parse.ts modified to use parsePool.run().
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the piscina worker pool for parse concurrency capping as specified in issue #33; no unrelated modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/issue-33-parse-worker-pool

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/lib/parse-pool.test.ts (1)

5-7: ⚡ Quick win

Assert the exact thread-cap formula, not just a lower bound.

This test won’t catch regressions where maxThreads exceeds Math.max(1, os.cpus().length - 1). Please assert equality to the expected value so it verifies the acceptance criterion directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/parse-pool.test.ts` around lines 5 - 7, Replace the loose lower-bound
assertion with an exact equality check: compute expected = Math.max(1,
os.cpus().length - 1) and assert
expect(parsePool.options.maxThreads).toBe(expected); ensure the test
imports/uses the same os module and references parsePool.options.maxThreads so
the test fails if the thread-cap formula changes.
src/api/parse.ts (1)

114-116: ⚡ Quick win

Type onProgress with ParseStage to remove the cast.

Use stage: ParseStage in the helper signature so stage as ParseStage is unnecessary.

As per coding guidelines: "src/**/*.{ts,tsx}: Use TypeScript strict mode with no any, no as unknown as, and no type assertions across module boundaries".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/parse.ts` around lines 114 - 116, Change the onProgress helper to
accept stage: ParseStage instead of stage: string so you can remove the type
assertion; update the signature of onProgress(stage: ParseStage, pct: number):
void and call updateJob(jobId, { stage, pct, status: 'running' }); (ensure
ParseStage is imported/available in this module and that jobId/updateJob usages
remain unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/parse.ts`:
- Around line 119-123: Replace the unsafe cast of the worker response from
parsePool.run to WorkerOutput with runtime validation using a Zod schema: define
a Zod schema matching WorkerOutput (tree, capabilities, etc.), run
parsePool.run(...) as unknown, pass the result through schema.parse or
safeParse, and use the parsed value for tree and capabilities; on schema
failure, log or throw a clear error and avoid calling onProgress('classifying',
75) with invalid data. Locate the call to parsePool.run in parse.ts and the
WorkerOutput type to model the Zod schema and ensure all downstream uses (e.g.,
onProgress) receive validated data.

In `@src/lib/parse-worker.ts`:
- Around line 25-27: The current fallback always calls parseDocx(buffer) for any
unexpected ext; add an explicit guard using the ext value (e.g., in the function
that calls parseDocx) to reject unsupported extensions instead of falling
through to DOCX parsing: check the ext (or mime) early, allow only the supported
cases (e.g., ".docx") to call parseDocx(buffer) and for any other ext throw or
return an explicit error/ rejection so the caller fails deterministically rather
than parsing as DOCX.

---

Nitpick comments:
In `@src/api/parse.ts`:
- Around line 114-116: Change the onProgress helper to accept stage: ParseStage
instead of stage: string so you can remove the type assertion; update the
signature of onProgress(stage: ParseStage, pct: number): void and call
updateJob(jobId, { stage, pct, status: 'running' }); (ensure ParseStage is
imported/available in this module and that jobId/updateJob usages remain
unchanged).

In `@src/lib/parse-pool.test.ts`:
- Around line 5-7: Replace the loose lower-bound assertion with an exact
equality check: compute expected = Math.max(1, os.cpus().length - 1) and assert
expect(parsePool.options.maxThreads).toBe(expected); ensure the test
imports/uses the same os module and references parsePool.options.maxThreads so
the test fails if the thread-cap formula changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e6e5b87f-a856-4ee9-bd43-9d19b40de982

📥 Commits

Reviewing files that changed from the base of the PR and between 85ef92a and c6a88df.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (6)
  • package.json
  • src/api/parse.test.ts
  • src/api/parse.ts
  • src/lib/parse-pool.test.ts
  • src/lib/parse-pool.ts
  • src/lib/parse-worker.ts

Comment thread src/api/parse.ts Outdated
Comment thread src/lib/parse-worker.ts Outdated
@thewrz thewrz merged commit de00f6f into main May 18, 2026
5 checks passed
@thewrz thewrz deleted the feat/issue-33-parse-worker-pool branch May 18, 2026 04:09
thewrz added a commit that referenced this pull request May 18, 2026
Status table: add 1c-iii..1c-viii and 1c-sec-i/ii rows covering PRs
#69 #70 #71 #72 #74 #75 #76. Updates 'Active development' subtitle
to reflect Phase 1c being complete.

Parsing section: add plaintext signal hardening (#70), parse-anomaly
warnings (#75), and DOCX resilience suite (#72) bullets.

MCP section: note POST /mcp rate limiting (#69).

Not Yet Built: strike completed items (DOCX cross-ref extraction in
PR #76, parse worker concurrency cap in PR #71). Add new known gap:
REST persistTree ignores extracted refs (follow-up to #53).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(security): parse worker concurrency cap — piscina, follow-up to #23

1 participant