perf(embeddings): cross-node batching + worker pool by theagenticguy · Pull Request #33 · theagenticguy/opencodehub

theagenticguy · 2026-04-27T14:19:29Z

Summary

Refactors the embeddings phase from one-embedding-per-node-per-await into two stages: a job-collection pass that walks symbol/file/community tiers in canonical order producing {text, emitRow} records, and a dispatch loop that fires workers × batchSize embeds concurrently per wave and scatters vectors back into the row buffer.
Adds a Piscina pool of independent OnnxEmbedder workers (packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts). Each worker holds its own ONNX session; the pool is exposed behind an Embedder-shaped facade so the phase doesn't branch. A main-thread canary OnnxEmbedder opens first so EmbedderNotSetupError keeps its class identity across the structured-clone boundary.
New flags: --embeddings-workers <n|auto> and --embeddings-batch-size <n> (defaults: 1 and 32 — unchanged single-threaded behaviour out of the box).

Motivation

Real-world codehub analyze --embeddings --force --granularity symbol,file,community on a ~1,922-file AWS codebase sat at 95% CPU for 7+ minutes before the refactor. The phase was awaiting embedBatch() per node inside a single-threaded ONNX session (intraOpNumThreads: 1, graphOptimizationLevel: "disabled" — required for the graphHash determinism contract), so there was no concurrency anywhere in the stack.

Determinism

The graphHash / embeddingsHash contract is preserved:

Canonical tier ordering (symbol → file → community) is unchanged.
Rows are still sorted by (granularity, nodeId, chunkIndex) before hashing.
openOnnxEmbedder()'s deterministic knobs are intact per worker — which input produces which vector is independent of which worker ran it.
New regression test asserts embeddingsHash at batchSize=1 equals embeddingsHash at batchSize=32.

Expected speedup

On an M-series laptop with --embeddings-workers auto --embeddings-batch-size 32, the 7-minute AWSQuickWork run should drop to roughly 1–2 minutes. --embeddings-int8 cuts that further.

Test plan

pnpm build — clean
pnpm --filter @opencodehub/ingestion test — 576/576 pass
New test: embeddings.test.ts — batchSize=1 vs batchSize=32 produce byte-identical embeddingsHash
codehub analyze --help surfaces --embeddings-workers and --embeddings-batch-size
End-to-end: run codehub analyze AWSQuickWork --embeddings --force --granularity symbol,file,community --embeddings-workers auto and confirm wall time drop + identical embeddingsHash vs a single-threaded control run

The embeddings phase was pegged to one embedding per node per await, behind a single-threaded ONNX session — an AWSQuickWork run sat at 95% CPU for 7+ minutes on 1,922 files. Refactor into two stages: walk tiers once to collect (text, emitRow) jobs in canonical order, then dispatch in fixed-size batches across a configurable Piscina pool of OnnxEmbedder workers. Each wave fires workers × batchSize embeds concurrently and scatters vectors back into the row buffer. Row ordering and the embeddingsHash contract are preserved — confirmed by a new test that asserts byte-identical hashes across batchSize=1 vs 32. - New flags: --embeddings-workers <n|auto>, --embeddings-batch-size <n>. - A main-thread canary OnnxEmbedder opens before the pool so EmbedderNotSetupError keeps its class identity across the structured-clone boundary. - HTTP backend unaffected (pool flag ignored when endpoint is set).

## Summary - Refactors the embeddings phase from one-embedding-per-node-per-await into two stages: a **job-collection** pass that walks symbol/file/community tiers in canonical order producing `{text, emitRow}` records, and a **dispatch** loop that fires `workers × batchSize` embeds concurrently per wave and scatters vectors back into the row buffer. - Adds a Piscina pool of independent `OnnxEmbedder` workers (`packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts`). Each worker holds its own ONNX session; the pool is exposed behind an `Embedder`-shaped facade so the phase doesn't branch. A main-thread canary `OnnxEmbedder` opens first so `EmbedderNotSetupError` keeps its class identity across the structured-clone boundary. - New flags: `--embeddings-workers <n|auto>` and `--embeddings-batch-size <n>` (defaults: 1 and 32 — unchanged single-threaded behaviour out of the box). ### Motivation Real-world `codehub analyze --embeddings --force --granularity symbol,file,community` on a ~1,922-file AWS codebase sat at 95% CPU for 7+ minutes before the refactor. The phase was awaiting `embedBatch()` per node inside a single-threaded ONNX session (`intraOpNumThreads: 1`, `graphOptimizationLevel: "disabled"` — required for the graphHash determinism contract), so there was no concurrency anywhere in the stack. ### Determinism The graphHash / `embeddingsHash` contract is preserved: - Canonical tier ordering (symbol → file → community) is unchanged. - Rows are still sorted by `(granularity, nodeId, chunkIndex)` before hashing. - `openOnnxEmbedder()`'s deterministic knobs are intact per worker — which input produces which vector is independent of which worker ran it. - New regression test asserts `embeddingsHash` at `batchSize=1` equals `embeddingsHash` at `batchSize=32`. ### Expected speedup On an M-series laptop with `--embeddings-workers auto --embeddings-batch-size 32`, the 7-minute AWSQuickWork run should drop to roughly 1–2 minutes. `--embeddings-int8` cuts that further. ## Test plan - [x] `pnpm build` — clean - [x] `pnpm --filter @opencodehub/ingestion test` — 576/576 pass - [x] New test: `embeddings.test.ts` — `batchSize=1` vs `batchSize=32` produce byte-identical `embeddingsHash` - [x] `codehub analyze --help` surfaces `--embeddings-workers` and `--embeddings-batch-size` - [ ] End-to-end: run `codehub analyze AWSQuickWork --embeddings --force --granularity symbol,file,community --embeddings-workers auto` and confirm wall time drop + identical `embeddingsHash` vs a single-threaded control run

theagenticguy force-pushed the feat/embeddings-parallel-batching branch from 282cbce to 9c762e8 Compare April 27, 2026 14:41

theagenticguy force-pushed the feat/embeddings-parallel-batching branch from 9c762e8 to 8bbf5b8 Compare April 27, 2026 15:20

theagenticguy enabled auto-merge (squash) April 27, 2026 15:30

theagenticguy disabled auto-merge April 27, 2026 15:30

theagenticguy merged commit f8454b5 into main Apr 27, 2026
19 checks passed

theagenticguy deleted the feat/embeddings-parallel-batching branch April 27, 2026 15:30

github-actions Bot mentioned this pull request Apr 27, 2026

chore: release main #30

Closed

This was referenced May 1, 2026

chore: release main #52

Closed

chore: release main #54

Closed

github-actions Bot mentioned this pull request May 11, 2026

chore: release main #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(embeddings): cross-node batching + worker pool#33

perf(embeddings): cross-node batching + worker pool#33
theagenticguy merged 1 commit into
mainfrom
feat/embeddings-parallel-batching

theagenticguy commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theagenticguy commented Apr 27, 2026

Summary

Motivation

Determinism

Expected speedup

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant