Skip to content

perf(embeddings): cross-node batching + worker pool#33

Merged
theagenticguy merged 1 commit into
mainfrom
feat/embeddings-parallel-batching
Apr 27, 2026
Merged

perf(embeddings): cross-node batching + worker pool#33
theagenticguy merged 1 commit into
mainfrom
feat/embeddings-parallel-batching

Conversation

@theagenticguy
Copy link
Copy Markdown
Owner

Summary

  • Refactors the embeddings phase from one-embedding-per-node-per-await into two stages: a job-collection pass that walks symbol/file/community tiers in canonical order producing {text, emitRow} records, and a dispatch loop that fires workers × batchSize embeds concurrently per wave and scatters vectors back into the row buffer.
  • Adds a Piscina pool of independent OnnxEmbedder workers (packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts). Each worker holds its own ONNX session; the pool is exposed behind an Embedder-shaped facade so the phase doesn't branch. A main-thread canary OnnxEmbedder opens first so EmbedderNotSetupError keeps its class identity across the structured-clone boundary.
  • New flags: --embeddings-workers <n|auto> and --embeddings-batch-size <n> (defaults: 1 and 32 — unchanged single-threaded behaviour out of the box).

Motivation

Real-world codehub analyze --embeddings --force --granularity symbol,file,community on a ~1,922-file AWS codebase sat at 95% CPU for 7+ minutes before the refactor. The phase was awaiting embedBatch() per node inside a single-threaded ONNX session (intraOpNumThreads: 1, graphOptimizationLevel: "disabled" — required for the graphHash determinism contract), so there was no concurrency anywhere in the stack.

Determinism

The graphHash / embeddingsHash contract is preserved:

  • Canonical tier ordering (symbol → file → community) is unchanged.
  • Rows are still sorted by (granularity, nodeId, chunkIndex) before hashing.
  • openOnnxEmbedder()'s deterministic knobs are intact per worker — which input produces which vector is independent of which worker ran it.
  • New regression test asserts embeddingsHash at batchSize=1 equals embeddingsHash at batchSize=32.

Expected speedup

On an M-series laptop with --embeddings-workers auto --embeddings-batch-size 32, the 7-minute AWSQuickWork run should drop to roughly 1–2 minutes. --embeddings-int8 cuts that further.

Test plan

  • pnpm build — clean
  • pnpm --filter @opencodehub/ingestion test — 576/576 pass
  • New test: embeddings.test.tsbatchSize=1 vs batchSize=32 produce byte-identical embeddingsHash
  • codehub analyze --help surfaces --embeddings-workers and --embeddings-batch-size
  • End-to-end: run codehub analyze AWSQuickWork --embeddings --force --granularity symbol,file,community --embeddings-workers auto and confirm wall time drop + identical embeddingsHash vs a single-threaded control run

@theagenticguy theagenticguy force-pushed the feat/embeddings-parallel-batching branch from 282cbce to 9c762e8 Compare April 27, 2026 14:41
The embeddings phase was pegged to one embedding per node per await,
behind a single-threaded ONNX session — an AWSQuickWork run sat at 95%
CPU for 7+ minutes on 1,922 files.

Refactor into two stages: walk tiers once to collect (text, emitRow)
jobs in canonical order, then dispatch in fixed-size batches across a
configurable Piscina pool of OnnxEmbedder workers. Each wave fires
workers × batchSize embeds concurrently and scatters vectors back into
the row buffer. Row ordering and the embeddingsHash contract are
preserved — confirmed by a new test that asserts byte-identical hashes
across batchSize=1 vs 32.

- New flags: --embeddings-workers <n|auto>, --embeddings-batch-size <n>.
- A main-thread canary OnnxEmbedder opens before the pool so
  EmbedderNotSetupError keeps its class identity across the
  structured-clone boundary.
- HTTP backend unaffected (pool flag ignored when endpoint is set).
@theagenticguy theagenticguy force-pushed the feat/embeddings-parallel-batching branch from 9c762e8 to 8bbf5b8 Compare April 27, 2026 15:20
@theagenticguy theagenticguy enabled auto-merge (squash) April 27, 2026 15:30
@theagenticguy theagenticguy disabled auto-merge April 27, 2026 15:30
@theagenticguy theagenticguy merged commit f8454b5 into main Apr 27, 2026
19 checks passed
@theagenticguy theagenticguy deleted the feat/embeddings-parallel-batching branch April 27, 2026 15:30
@github-actions github-actions Bot mentioned this pull request Apr 27, 2026
theagenticguy added a commit that referenced this pull request May 1, 2026
## Summary
- Refactors the embeddings phase from one-embedding-per-node-per-await
into two stages: a **job-collection** pass that walks
symbol/file/community tiers in canonical order producing `{text,
emitRow}` records, and a **dispatch** loop that fires `workers ×
batchSize` embeds concurrently per wave and scatters vectors back into
the row buffer.
- Adds a Piscina pool of independent `OnnxEmbedder` workers
(`packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts`).
Each worker holds its own ONNX session; the pool is exposed behind an
`Embedder`-shaped facade so the phase doesn't branch. A main-thread
canary `OnnxEmbedder` opens first so `EmbedderNotSetupError` keeps its
class identity across the structured-clone boundary.
- New flags: `--embeddings-workers <n|auto>` and
`--embeddings-batch-size <n>` (defaults: 1 and 32 — unchanged
single-threaded behaviour out of the box).

### Motivation
Real-world `codehub analyze --embeddings --force --granularity
symbol,file,community` on a ~1,922-file AWS codebase sat at 95% CPU for
7+ minutes before the refactor. The phase was awaiting `embedBatch()`
per node inside a single-threaded ONNX session (`intraOpNumThreads: 1`,
`graphOptimizationLevel: "disabled"` — required for the graphHash
determinism contract), so there was no concurrency anywhere in the
stack.

### Determinism
The graphHash / `embeddingsHash` contract is preserved:
- Canonical tier ordering (symbol → file → community) is unchanged.
- Rows are still sorted by `(granularity, nodeId, chunkIndex)` before
hashing.
- `openOnnxEmbedder()`'s deterministic knobs are intact per worker —
which input produces which vector is independent of which worker ran it.
- New regression test asserts `embeddingsHash` at `batchSize=1` equals
`embeddingsHash` at `batchSize=32`.

### Expected speedup
On an M-series laptop with `--embeddings-workers auto
--embeddings-batch-size 32`, the 7-minute AWSQuickWork run should drop
to roughly 1–2 minutes. `--embeddings-int8` cuts that further.

## Test plan
- [x] `pnpm build` — clean
- [x] `pnpm --filter @opencodehub/ingestion test` — 576/576 pass
- [x] New test: `embeddings.test.ts` — `batchSize=1` vs `batchSize=32`
produce byte-identical `embeddingsHash`
- [x] `codehub analyze --help` surfaces `--embeddings-workers` and
`--embeddings-batch-size`
- [ ] End-to-end: run `codehub analyze AWSQuickWork --embeddings --force
--granularity symbol,file,community --embeddings-workers auto` and
confirm wall time drop + identical `embeddingsHash` vs a single-threaded
control run
This was referenced May 1, 2026
@github-actions github-actions Bot mentioned this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant