Skip to content

add public reusable for qvac-cli#2

Merged
Proletter merged 1 commit into
mainfrom
qvac-cli-integration
Jan 8, 2026
Merged

add public reusable for qvac-cli#2
Proletter merged 1 commit into
mainfrom
qvac-cli-integration

Conversation

@Proletter

Copy link
Copy Markdown
Collaborator

No description provided.

@Proletter Proletter merged commit 2c4ea2a into main Jan 8, 2026
maxim-smotrov added a commit that referenced this pull request Feb 28, 2026
maxim-smotrov added a commit that referenced this pull request Feb 28, 2026
maxim-smotrov added a commit that referenced this pull request Mar 1, 2026
maxim-smotrov added a commit that referenced this pull request Mar 4, 2026
maxim-smotrov added a commit that referenced this pull request Mar 4, 2026
gianni-cor added a commit that referenced this pull request Mar 5, 2026
* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Minor fixes for clarity

* Removing unused vars

* Removing unused imports

* Removing unused python deps

---------

Co-authored-by: gianni <gianfranco.cordella@tether.io>
maxim-smotrov added a commit that referenced this pull request Apr 10, 2026
donriddo added a commit to donriddo/qvac that referenced this pull request Apr 17, 2026
Previous commit 979a070 reworded only my own addition (line 251) but
the block still failed at the same position because the surrounding
pre-existing message bodies still used ; as a statement separator.
Mermaid sequenceDiagram parses ; as end-of-statement, so every message
containing it broke the diagram.

Replace ; with , or a separator word across all four affected lines
(block tetherto#1 lines 251, 256, 266 and block tetherto#2 line 296) so the finetune
and pause flow diagrams render on GitHub.
gianni-cor added a commit that referenced this pull request Apr 17, 2026
…ightsProvider (#1494)

* chore[bc]: remove BaseInference inheritance and WeightsProvider from LLM addon

Replace class inheritance with composable utilities from @qvac/infer-base@0.4.0:
- createJobHandler() for single-job lifecycle management
- exclusiveRunQueue() for run serialization
- Direct shard streaming via bare-fs instead of WeightsProvider

Constructor now takes { files: { model: string[], projectionModel?: string }, config, logger, opts }
instead of { loader, diskPath, modelName, projectionModel } + config.

All finetune, media, and filtered logger functionality preserved.

* fix: correct FinetuneProgress and finetune terminal handling in output callback

FinetuneProgress must call updateStats(data.stats), not updateOutput(data).
Finetune terminal JobEnded must call ended(data) as result, not updateStats.

* fix: update all LLM examples and model-loading test to new constructor shape

Update 13 examples and sharded model test to use files: { model: [...] } pattern.
Remove FilesystemDL dependency from all examples and tests.

* fix: update sharded model test to download shards to disk first

The network loader test used the old loader-based constructor.
Rewritten to download shards via HttpDL to disk, then pass absolute paths.

* fix: update LLM benchmark tooling to new constructor shape

* fix: update LLM perf benchmark sweep and judge to new constructor shape

* docs: update LLM README, finetuning, and afriquegemma docs for new constructor

* fix: update LLM prepare-prompts and verify-prompts to new constructor

* fix: update LLM finetuning unit tests to new constructor and exclusiveRunQueue

* docs: update LLM architecture, data-flows, finetuning, README sharded contract

* docs: align LLM finetuning docs and mobile README with new constructor

* chore[bc]: address PR #1494 review findings and bump to 0.15.0

Bumps `@qvac/llm-llamacpp` to `0.15.0` per the addon-changelog
process — minor bump on a pre-1.0 package signals the breaking
constructor change to consumers using semver ranges. Adds the
matching `0.15.0` block to `CHANGELOG.md` documenting the new
single-object constructor with `files`, the removal of
`BaseInference` + `WeightsProvider`, the dropped `destroy()`
method, the dependency churn, and every behaviour change in this
release.

Hardens the JS layer based on the review:

- Constructor now throws a clear `TypeError` when `files` /
  `files.model` is missing or empty, instead of crashing with an
  opaque "cannot read properties of undefined" later.
- `_runInternal` now throws "Addon not initialized. Call load()
  first." when invoked before `load()`, matching `finetune()` and
  the diffusion addon.
- `_load()` wraps `_streamShards` + `addon.activate()` in a
  try/catch that best-effort-unloads the partially-initialized
  native instance and resets `this.addon = null` so a subsequent
  `load()` does not leak a zombie addon.
- `createJobHandler({ cancel })` closure uses optional chaining so
  a stale `response.cancel()` after `unload()` is a no-op rather
  than a `TypeError`.
- `unload()` sets `this.addon = null` after `addon.unload()`, so
  the new `if (!this.addon)` guard in `_runInternal` is also
  effective post-unload.
- `pause()` and `cancel()` re-add the defensive `?.cancel` check.
- The `_load()` primary-path selection now picks the first entry
  matching the shard regex, replacing the fragile `[length - 1]`
  index. This stays compatible with the documented sharded order
  (`tensors.txt` first, shards second) and with the non-sharded
  single-file path; an inline comment explains the contract.
- The `_handleAddonOutputEvent` error log line now passes the
  `Error` object directly so loggers can format the full stack.

Drops dead `_isSuppressedNoResponseLog` /
`_createFilteredLogger` / `_originalLogger` plumbing. Those
existed to swallow `'No response found for job'` warnings emitted
by the old `BaseInference._jobToResponse` Map; the new
`createJobHandler`-based architecture cannot emit that message,
so the filter, the wrapped logger, and the `_originalLogger`
indirection are all gone. The user-supplied logger is now used
directly.

Restores JSDoc on every `FinetuneOptions` field in `index.d.ts`,
including default values (`numberOfEpochs = 1`,
`learningRate = 1e-4`, `batchSize = 128`, …) so IDE tooltips show
them without needing to read `docs/finetuning.md`.

* refactor: move LLM C++ event normalization into addon.js

Per the team-2 task doc (`TD-ADDON-INTERFACE-LLM-EMBED-SD.md`,
LLM section): "Move event name normalization from `index.js`
`_addonOutputCallback` into `addon.js` `LlamaInterface` — the
native binding wrapper should own the mapping from raw C++ events
to Output / Error / JobEnded / FinetuneProgress."

Adds `mapAddonEvent(rawEvent, data, error, state)` as a free
export from `addon.js`, co-located with `LlamaInterface`. The
function normalizes the C++-mangled event vocabulary into one of
`Output` / `Error` / `JobEnded` / `FinetuneProgress`, including:

- TPS-shaped runtime stats → JobEnded with `backendDevice`
  mapped from `0/1` to `'cpu'/'gpu'`.
- Finetune terminal payloads (`{op:'finetune', status, stats?}`)
  → JobEnded carrying the finetune payload, and arms the skip
  flag so the trailing TPS stats from the finetune are not
  dispatched as a fresh inference terminal.
- `finetune_progress` payloads → FinetuneProgress.
- Anything else with an `Error`-flavored event name → Error.
- String payloads → Output.

`LlmLlamacpp._addonOutputCallback` becomes a thin shim that
imports `mapAddonEvent`, hands it the per-instance state object
(now `this._addonEventState = { skipNextRuntimeStats }` instead
of the bare `_skipNextRuntimeStats` field), and forwards the
mapped event to `_handleAddonOutputEvent`.

Stateful flag lives on the model so unit tests can still poke at
it via `model._addonEventState.skipNextRuntimeStats`. Updated all
9 references in `test/unit/finetuning.test.js`. All 31 unit
tests still pass; lint and dts checks clean.

Also fixes the misleading JSDoc on `LlamaInterface.loadWeights`:
the native binding reads the JS property name `chunk` (verified
in `qvac-lib-inference-addon-cpp/JsBlobsStream.hpp::appendBlob`,
lines 41–42 and 66–67), not `contents`. The C++ local variable
is named `contents`, which is what the proposal text was
referencing — but the on-the-wire JS property name is `chunk`
and the JS layer call sites are correct.

* fix: address PR #1494 second-round review findings

1. `test/integration/http-loader.js` no longer extends
   `@qvac/dl-base`. The base class was only providing a `close()`
   shim around `_close()`, and the package's devDependencies no
   longer list `@qvac/dl-base` after the loader-removal refactor.
   The helper now stands on its own — `getStream()` and `close()`
   are the only methods the sharded model-loading test calls, so
   the rest of the BaseDL surface (including the unused
   `getFileSize` and `list`) is dropped. Removes the dangling
   require that would break a clean install of this package and
   block the sharded test in CI.

2. `examples/multiModal.js` no longer passes `content: imageFilePath`
   on the second `media` message. The native binding only accepts
   `Uint8Array` payloads on `media` messages — file paths were
   silently broken after the loader removal. The example now
   reuses the same `imageBuffer` for both inferences and uses a
   different prompt on the second one to keep the example
   pedagogically distinct.

3. `index.d.ts` `AddonMessage` now exposes the optional
   `generationParams?: GenerationParams` field. The runtime path
   in `LlmLlamacpp._runInternal` already serializes this field
   onto every text message it forwards through `addon.runJob`,
   but the published transport type omitted it — IDE consumers
   building their own message-shaped payloads would lose the
   per-call overrides. The field documents that it is forwarded
   from `RunOptions.generationParams` and is the canonical way
   to vary sampling per request without re-loading the model.

* fix: extract pickPrimaryGgufPath, restore multiModal example, fix docs

- Extract shard-picker logic into named pickPrimaryGgufPath() with unit
  tests documenting the contract (tensors.txt-first ordering, single-file
  fallback). Move SHARD_REGEX inside the function.
- Revert multiModal.js to original: first inference uses Uint8Array,
  second uses string path. Both C++ code paths work. Remove false comment
  claiming file paths are not supported.
- Restore stripped JSDoc on FinetuneValidationSplit.fraction and
  FinetuneValidationDataset.path in index.d.ts.
- Fix docs/architecture.md and docs/data-flows-detailed.md: 4 occurrences
  incorrectly said "last" shard is the primary path; actual code picks
  the first shard regex match.
- Hardcode shard filenames in model-loading integration test instead of
  generating them via regex.
- Add network streaming capability loss note to CHANGELOG.

* fix: correct version in architecture.md and remove stale dl-filesystem benchmark dep

- docs/architecture.md header: v0.14.3 → v0.15.0 to match package.json
- benchmarks/performance/package.json: remove @qvac/dl-filesystem (no
  longer used after FilesystemDL references were removed from all
  benchmark JS files)

* fix: align _hasActiveResponse clearing with embed pattern

Remove the synchronous clear in _handleAddonOutputEvent on JobEnded/Error.
The .finally() on response.await() already clears the flag when the response
promise settles, and exclusiveRunQueue serializes _runInternal so the next
call cannot race the current one. Matches the embed addon's pattern, where
.finally() is the sole clear path outside of unload().

* fix: throw on second load(), log rejected responses, add mapAddonEvent unit test

- load(): throw if already loaded. Caller must unload() first. Aligns
  with the team consensus (Yury/Gianfranco/Gustavo) — silent reload
  masks caller bugs. unload() already clears configLoaded.
- _runInternal / finetune: replace silent `finalized.catch(() => {})`
  with a warn-level log so rejected responses are not swallowed when
  the caller does not await.
- test/unit/map-addon-event.test.js: new unit test covering TPS stats
  mapping + backendDevice translation, skipNextRuntimeStats dropping,
  finetune terminal + skip-flag arming, finetune_progress, Error event,
  string-as-token Output, and default fall-through.
- CHANGELOG 0.15.0: document the load() throw.

* fix: restore JSDoc on run() that was dropped during BaseInference removal

The JSDoc documenting run()'s prompt and runOptions parameters was
accidentally removed during the BaseInference removal refactor when
run() was split into run() + _runInternal(). Restore it on the public
run() method, and reference the full RunOptions type (which already
documents prefill / generationParams / cacheKey / saveCacheToDisk in
index.d.ts) so the docs stay authoritative in one place.

* fix: migrate afriquegemma-edge-cases test to new addon constructor

The afriquegemma-edge-cases.test.js file came in via the upstream/main
merge but still used the pre-refactor constructor shape:
  new LlmLlamacpp({ loader, modelName, diskPath, ... }, config)
with a FilesystemDL loader. All 7 tests in the file are now migrated to:
  new LlmLlamacpp({ files: { model: [path.join(dirPath, modelName)] },
                    config, logger, opts })
Removed FilesystemDL import and all loader.close() calls. Added
isMobile skip flag matching the pattern in afriquegemma-translation.

Caught by the qvac-staff-code-reviewer agent as a "merge brought in a
new consumer of the old API" — restore-the-class issue across the family.

* fix: make load() idempotent when already loaded

Second load() on an already-loaded instance returns immediately instead
of throwing. Matches the ReadyResource pattern used elsewhere in QVAC:
open/load is idempotent; explicit unload() is required to swap weights.

CHANGELOG updated.

* test: regenerate mobile integration auto.cjs

Integration test files were touched during the refactor and the
generated mobile harness was not regenerated. `npm run test:mobile:generate`
output committed so `validate-mobile-tests.js` passes.

* doc: document missing breaking changes from BaseInference removal

Address feedback to report all breaking changes from the BaseInference
refactor, not just the constructor shape:

- getState() narrows from {configLoaded, weightsLoaded, destroyed}
  to {configLoaded} only
- LlmLlamacpp public methods removed: downloadWeights, unpause, stop,
  status, destroy, getApiDefinition (destroy was already mentioned;
  other five were missing)
- load() takes no arguments (was (closeLoader, onDownloadProgress))
- Type exports removed from index.d.ts: ReportProgressCallback,
  Loader, DownloadWeightsOptions, DownloadResult

Also fix the stale (0.15.0) version marker in the AFTER code block.

* fix: address lifecycle, validation, and CI-surface review findings

- load() now runs through `this._run()` so concurrent calls on the same
  instance serialize instead of racing past the `configLoaded` guard.
  Two overlapping loads could previously both allocate a native addon
  and clobber `this.addon`, leaking one native handle.
- Constructor now validates each `files.model` entry with
  `path.isAbsolute()` and applies the same check to the optional
  `files.projectionModel` (which previously had no validation at all).
  Relative paths are rejected at construction time instead of bubbling
  up from bare-fs / native load.
- `pickPrimaryGgufPath` is now declared in `index.d.ts` so the TS
  surface matches the CommonJS export at `index.js`.
- Add `test:unit` and `test:unit:generate` scripts that run the JS
  unit tests under `test/unit/*.test.js` via brittle + bare. Wire
  `test:unit` into `test:all` and into the PR workflow's ts-checks
  job so `map-addon-event.test.js`, `pick-primary-gguf-path.test.js`,
  and the pre-existing `finetuning.test.js` all run on every PR.

* doc: add CHANGELOG entries for load() serialization and absolute-path validation

* fix[ci]: run test:unit via run-lint-and-unit-tests action

Replace my hand-rolled test:unit step (which invoked `bare` in a job
that never installs it) with the existing run-lint-and-unit-tests
external action. Same pattern qvac-lib-infer-onnx and ocr-onnx already
use. The action installs bare globally and runs
`npm run test:unit --if-present`.

Also chain test:unit into the `test` script for local dev convenience,
matching the standalone-repo precedent (qvac-lib-inference-addon-base,
qvac-lib-dl-filesystem, etc.).

* doc: fix mermaid parsing errors in architecture.md and finetuning.md

architecture.md:159 — mermaid classDiagram uses { } as class-body
delimiters; the inline destructured-object syntax in the constructor
signature broke parsing. Replace with the canonical named type
`LlmLlamacppArgs` from index.d.ts so the class diagram renders.

finetuning.md:251 — sequence-diagram message contained `(_run)` and
`_hasActiveResponse` where the leading underscore was being
interpreted as mermaid italic-open, and slashes in
`validationSplit/useEvalDatasetForValidation/evalDatasetPath` made
the message ambiguous. Reword to use prose-style commas and drop the
leading-underscore identifiers.

Reported by maxim-smotrov.

* chore[ci]: rename step to reflect what the action actually runs

The run-lint-and-unit-tests action runs `npm run lint` and
`npm run test:unit` (and installs bare in between). The step name
"Run JavaScript tests" hides the lint half. Rename to
"Run lint and unit tests" and update the step id accordingly.

* fix: readme, finetune lifecycle, multimodal type

README quickstart, sharded, and OCR examples now use `path.resolve('./models')`
so the resulting `files.model` entries and `files.projectionModel` are
absolute. The refactored constructor rejects relative paths, which meant
the README snippets threw `TypeError` when copied verbatim.

`finetune()` moves the `!this.addon` readiness check and the
`_checkpointSaveDir` assignment inside the `this._run(...)` closure,
matching the pattern `run()` uses via `_runInternal`. If `unload()` is
already queued ahead of `finetune()`, the guard now runs after
`unload()` nulls `this.addon` instead of before, so the caller gets the
intended "Call load() first." error rather than a null-dereference
crash inside the queued body.

`UserMediaMessage.content` widens from `Uint8Array` to `Uint8Array | string`.
The C++ layer has always accepted both (raw bytes go through `parseMedia`;
string paths go through `loadMedia` in LlamaModel.cpp), and the OCR /
multimodal examples exercise the string-path form. The d.ts was
inadvertently narrower than the runtime contract.

* fix: preserve LogMsg event name in mapAddonEvent

Native `JsLogMsgOutputHandler` emits log events whose payload is a
plain string (`js::String::create(env, logMsg)`). The old mapping had
a generic `typeof rawData === 'string'` fallback that remapped every
string-payload event to `Output`, so any native LogMsg was quietly
pushed into the job output stream instead of the logger. The
`_handleAddonOutputEvent` branch that routes `LogMsg` to
`this.logger.info()` was therefore unreachable.

Check the `LogMsg` event name before the string-to-Output fallback so
log messages keep their type and reach the logger. Add a unit test
covering the precedence.

* doc: restore class JSDoc, method JSDoc, and media-separation comments

Restore documentation that the refactor dropped but whose content is
still accurate against the refactored code:

- Class-level JSDoc on LlmLlamacpp describing what the class does.
- Short JSDoc on pause(), cancel(), and unload() explaining each method's
  purpose, including how pause() saves a resumable checkpoint and how
  cancel() wipes it so the next finetune() starts fresh.
- Inline comments in _runInternal explaining the media/text separation:
  binary blobs go into promptMessages as type: 'media' entries in order,
  then the JSON text payload carries empty-content placeholders for each
  media item so tokenization can align.

* doc: shorten pickPrimaryGgufPath JSDoc in d.ts to a single line

Declaration-file JSDoc surfaces in IDE hover tooltips, so multi-paragraph
prose is noise. Trim to a one-liner covering the only behavior the type
hover needs to convey. The "exported for unit testing" rationale is
dropped since consumers do not need it on the type surface.

* doc: trim verbose comments added during the refactor

Tighten comments this PR introduced that drifted into over-explanation.
Leave pre-existing comments as-is.

- addon.js mapAddonEvent JSDoc: drop the multi-paragraph prose about
  C++ event naming and stateful ordering; keep the one-sentence
  contract plus the param block.
- index.js pickPrimaryGgufPath JSDoc: replace the multi-paragraph
  explanation of the caller's shard-list contract with a single-line
  summary citing the C++ regex contract.
- index.js class header on LlmLlamacpp: reduce to a single purpose line.
- index.js constructor block: shorten the lazy-deref rationale and the
  _addonEventState comment to one line each.
- index.js _addonOutputCallback: reduce the three-line comment
  pointing at addon.js to a single line. The detailed rationale is
  already in addon.js mapAddonEvent JSDoc.
- index.js media-separation comment: restore the one-line wording that
  already existed on main; earlier revision expanded it into three
  lines unnecessarily.

* doc: drop narration comment on _addonOutputCallback

The comment said "Event-name normalization lives in addon.js
(mapAddonEvent)", but the very next line imports and calls
mapAddonEvent — the code already tells the reader where event mapping
lives. Remove the line so the code speaks for itself.

* doc: restore FinetuneOptions JSDoc to pre-refactor forms

The refactor commit unintentionally rephrased FinetuneOptions JSDoc
lines that the refactor itself did not change. Revert those fields back
to main's original wording so the diff only carries structural changes
tied to the interface migration.

* doc: restore pre-refactor load/createAddon logs and JSDoc

The refactor commit silently dropped the _load() progress logs ('Creating
addon with configuration', 'Activating addon'), the 'Error during model
load' error log, and the JSDoc block on _createAddon(). Put them back so
the refactor only changes what needs to change.

* chore: drop unused 'test' script, inline into 'test:all'

The 'test' alias was only consumed by 'test:all', and neither was
referenced in CI workflows or the README. 'test:all' ran test:unit
twice because it called both test:unit and the 'test' alias. Remove
'test' and rewrite 'test:all' to run test:unit, test:integration, and
test:cpp directly.

* doc: correct pre-refactor constructor marker to <= 0.15.x

0.15.x still used the old (args, config) constructor shape; the old
example applies to any 0.15.x caller, not just 0.14.x. Align the
CHANGELOG marker with the PR body.

* test: run AfriqueGemma tests on mobile, matching main

The backmerge of upstream/main carried a stale 'skip: isMobile' from
the pre-refactor translation test into the six new translation tests
and the edge-cases migration. Main's a570189 deliberately dropped
the mobile skip; restore that intent. The isMobile constant is
unused after this and dropped.

* doc, test: fix _createAddon JSDoc and cover string-path media content

_createAddon() JSDoc referenced 'configurationParams.settings' and
omitted 'projectionPath'. The actual shape built in _load() is
{ path, projectionPath, config }; align the JSDoc with that.

UserMediaMessage.content widened to Uint8Array | string earlier in
this PR but no integration test exercised the string-path branch.
Add one elephant-image test that passes the absolute path as
message content, exercising the loadMedia(string) path through the
JS-to-C++ handoff.

* build: promote @qvac/logging to runtime dependency

index.js requires('@qvac/logging') at runtime, so it belongs under
dependencies, not devDependencies. Previously it worked only because
another runtime dep pulled it in transitively — fragile for publish
and can break under stricter package managers.

* doc: finish finetuning.md mermaid fix

Previous commit 979a070 reworded only my own addition (line 251) but
the block still failed at the same position because the surrounding
pre-existing message bodies still used ; as a statement separator.
Mermaid sequenceDiagram parses ; as end-of-statement, so every message
containing it broke the diagram.

Replace ; with , or a separator word across all four affected lines
(block #1 lines 251, 256, 266 and block #2 line 296) so the finetune
and pause flow diagrams render on GitHub.

* fix: move addon construction into crash-safe try block

_createAddon() was outside the try so a synchronous throw in
require('./binding') or binding.createInstance() would leave
this.addon set to a partial native handle and never reach the
cleanup path. Route addon construction through the same try the
shard-streaming and activate() calls use.

---------

Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
DmitryMalishev added a commit that referenced this pull request Apr 29, 2026
CI run 25074595106 confirmed the two-phase test-side drain
(commit f26f561) is sufficient for the upstream `OutputCallBackJs`
UAF on every platform: linux-x64/-arm64, darwin-arm64,
android-arm64, ios-arm64 all pass.

Only `win32-x64-integration-tests` still fails, and it does so for
a completely different upstream issue: the first
`js_create_double` call inside an `OutputCallBackJs` callback
returns 0.0 on win32-x64 (clang-cl + bare-runtime + V8) regardless
of the input. Subsequent calls in the same handle scope are
correct. The bug zeros out the highest-confidence value on every
classify() call, breaks the sort order, and trips
`meal_1.jpg "sorted desc [0]>=[1]"` (CI runs 24851301107,
24891210942, 24897445066, 24900278513, 25002820522, 25062157099,
25070800838, 25074595106).

There is no test-side workaround for this one. Sleeps don't help
because it isn't a lifecycle race. Other addons accidentally dodge
it for the reasons enumerated in the comment block at the top of
`AddonJs.hpp` (first emitted number is naturally 0; tests assert
only typeof / !isNaN; first number never asserted on; or no
numbers emitted at all). Our 3-class triage assertions cover none
of those, so the bug remains visible in CI.

Fix: restore the local C++ "burn one" workaround that was removed
in commit 7ccb9f5. A throwaway `js_create_double(env, 0.0,
&dummy)` call at the top of `JsClassifyOutputHandler`'s lambda
consumes the broken first slot; the per-element `Number::create`
calls that follow produce the correct value at index 0. The
throwaway value is never wired into the result array; cost is one
ephemeral js_number per classify() call.

The asymmetry between issues #1 (test-side sleep is enough) and
#2 (needs C++ workaround) is now documented at the top of
AddonJs.hpp -- including the CI runs that surfaced each, why the
test-side approach worked for one and not the other, and the
explicit rationale ("removed once upstream marshalling layer is
patched") for revisiting both.

Local validation on win32-x64:
- `bare-make build` clean.
- `npm run test:integration` 14/14 tests, 140/140 asserts (was
  failing on `meal_1.jpg sorted desc [0]>=[1]` before this).

Expected CI behaviour after this commit:

- Linux x64/arm64, Darwin arm64, Android arm64, iOS arm64 should
  keep passing (this commit doesn't touch their code paths).
- win32-x64 should now pass: the burn-one consumes the broken
  first slot and every per-element confidence marshalls correctly.

File: packages/qvac-lib-infer-ggml-classification/addon/src/addon/AddonJs.hpp
Made-with: Cursor
GustavoA1604 added a commit that referenced this pull request May 7, 2026
Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit that referenced this pull request May 11, 2026
…#1983)

* feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp)

New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed
by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg).  API-compatible
with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream
consumers can swap backends without touching orchestration code.

## Scope

* First iteration.  Supports Chatterbox **English** only.  Chatterbox
  multilingual, LavaSR enhancer, Supertonic engine, and streaming are
  out of scope and remain in `@qvac/tts-onnx`.  They'll land alongside
  the evolution of qvac-tts.cpp.
* Native backend is the static `qvac-tts` library from the QVAC vcpkg
  registry (`ports/tts-cpp`, baseline `2026-04-21`).  No ONNX Runtime
  dependency.

## JS surface

* `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as
  `ONNXTTS`:  `run` / `runStream` / `runStreaming` / `reload` /
  `unload` / `destroy`.
* `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` +
  `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` /
  `files.s3genModel` override the defaults.
* Options: `referenceAudio`, `voiceDir` (baked profile), `seed`,
  `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for
  the upcoming streaming flags (`streamChunkTokens`,
  `streamFirstChunkTokens`, `cfmSteps`).
* Shared reusable lib code (`lib/textChunker.js`,
  `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim
  from `@qvac/tts-onnx`.
* New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000**
  to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both
  packages are loaded in the same Bare process.

## Native addon

* `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` —
  `IModel` + `IModelCancel` implementation.  First-iteration strategy:
  assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output
  path, call it synchronously, then parse the resulting 16-bit mono
  PCM wav back into `std::vector<int16_t>` for the JS handler.
  Consequences: every job re-loads the model (~700 ms + inference
  time), no mid-synthesis cancellation, no streaming.  The follow-up
  milestone replaces this with a persistent, struct-based API once
  qvac-tts.cpp exposes one.
* `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++
  config bridging (same string-map pattern as `@qvac/tts-onnx`) and the
  `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing
  `createInstance` / `runJob` / `reload` / `activate` / `cancel` /
  `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`.
* `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob`
  / `reload` wrappers that register a `JsAudioOutputHandler` emitting
  `{ outputArray: Int16Array, sampleRate: number }` to JS.

## Build / registry

* `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)`
  and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape
  matches `@qvac/transcription-whispercpp`).
* `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough)
  plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`.
* `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg.
  NOTE: the baseline pin here is inherited from
  `@qvac/transcription-whispercpp` and **must be bumped** to a commit
  that contains the `tts-cpp` port once that registry PR lands.  A
  follow-up commit will update it.

## Tests & examples

* Integration + unit test files for Chatterbox English are copied
  verbatim from `@qvac/tts-onnx` with only mechanical renames
  (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`,
  `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`).  Some
  paths in `test/integration/addon.test.js` still import Supertonic /
  LavaSR helpers that don't exist in this package — those test blocks
  will fail fast when the file loads, which is expected until those
  backends get their own ggml packages.
* Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus
  shared `wav-helper.js` + `pcm-chunk-player.js`.

## What's not in this PR (known gaps)

* No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes
  will land in a single documentation pass once the registry + fork
  commits have merged upstream.
* `vcpkg-configuration.json` baseline needs to point at a
  qvac-registry-vcpkg commit that ships `tts-cpp` (pending the
  registry PR).
* Actual `npm run build` requires the registry and fork commits to be
  on `main` of their respective upstream repos.

* chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit

Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg
at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that
adds the `tts-cpp` port.  Paired with the `qvac-tts` library already
pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp
@ 0fe4a521618cc30358040b29d75d4261b31cbb60).

Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry
PR lands upstream.

* chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper

Second pass over @qvac/tts-ggml after the build started passing: prune
everything that only made sense for the ONNX-era multi-engine scope and
adapt the remaining Chatterbox-English bits to the GGUF + file-path
reference-audio contract.  Restores `test/mobile/` so the Android build
has something to point at.

## C++

* `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment
  contained `**/` which closed the block comment early and broke the
  build.  Rewrote as a `//` comment.

## Examples

* `examples/chatterbox-tts.js` — rewrite for v0 contract: single
  `<text>` argv, `files: { modelDir }` pointing at the two GGUFs,
  `referenceAudio` is now a wav **path** (addon passes it to
  `--reference-audio`) instead of a Float32Array.  Drops
  english/multilingual arg and the CHATTERBOX_VARIANT switch that
  picked which `.onnx` files to load.
* Removed `examples/chatterbox-streaming-tts.js` +
  `examples/pcm-chunk-player.js`.  The v0 addon re-loads the model
  per `run()` call — exposing streaming would mislead.  Both come
  back alongside the persistent-engine milestone.
* `package.json`: `npm run example` now passes a default text so it
  runs without extra args.

## Tests

### Kept as-is (engine-agnostic)

* `test/unit/textChunker.test.js`
* `test/mock/{MockedBinding,utils}.js`
* `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js`
* `test/reference-audio/jfk.wav`, `test/data/sentences-*.js`

### Mechanical fixes

* `test/unit/tts.error.test.js` — fix error-code assertions to the
  tts-ggml range (`13001–14000`); was still checking the
  `@qvac/tts-onnx` range (`7001–7011`).
* `test/unit/tts-ggml.lifecycle.test.js` — fix stale
  `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the
  stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the
  non-existent `engine: 'chatterbox'` option.
* `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine
  cleanup.

### Rewritten

* `test/unit/chatterbox.inference.test.js` — drop tests that asserted
  the old ONNX file shape (`tokenizer / speechEncoder / embedTokens /
  conditionalDecoder / languageModel`), the removed `engine` detection
  and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`).
  New tests cover: `modelDir` derives the two GGUF paths; explicit
  `t3Model` / `s3genModel` override the defaults.  The mocked-binding
  run/reload/cancel flow stays.
* `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English
  only.  Ensures the GGUFs are present, runs the short sentence set
  through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and
  (on darwin only) runs a whisper-based WER check via the existing
  `runWhisper` util.  Drops the Chatterbox-multilingual block + every
  Supertonic + LavaSR block that doesn't apply to this package.
* `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract:
  `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a
  file path that falls back to `test/reference-audio/jfk.wav` (or the
  mobile test-asset when `global.assetPaths` is present).  No more
  WAV decode / resample on the JS side.
* `test/utils/downloadModel.js` — trim from 1007 LoC to 280.  Drops
  the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie
  downloaders.  Keeps the shared HTTP/curl infrastructure and
  `ensureWhisperModel` (still used by the integration WER check).
  `ensureChatterboxModels` is now **check-only**: it verifies
  `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally
  and, if missing, prints the exact commands for generating them
  from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts.
  Once the GGUFs land on a canonical HuggingFace repo we'll wire up
  download URLs here.

## Scripts

* `scripts/ensure-chatterbox.js` — simplify to a single invocation
  against `./models/`.  Drops the variant / language matrix that the
  ONNX downloader needed.
* `scripts/ensure-models.js` — now a thin alias to
  `ensure-chatterbox.js`.  Drops the Supertonic + LavaSR orchestration.

## Mobile

* Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs,
  testAssets/jfk.wav}` so the Android build has a wrapper to point at.
* `package.json`: re-added `test/mobile` to the `files` list.

## Gitignore

* Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp`
  (produced by the top-level `configure_file(...)` calls) and
  `build_*/` dirs (bare-make convention).

## Verified locally

* `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean.
* `npm run test:unit` — 38/38 pass (105/105 asserts).
* `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."`
  produces a 24 kHz wav as expected.

* Add streaming support

* Update ggml backend to use separate ggml repo

* tts-ggml: consume renamed tts-cpp library (2026-04-24#1)

Upstream chatterbox.cpp renamed the package + namespace + target from
qvac-tts to tts-cpp and tightened the library boundary; pick up the
new artefacts here:

- find_package(qvac-tts-cpp CONFIG REQUIRED)
    -> find_package(tts-cpp CONFIG REQUIRED)
- qvac-tts::qvac-tts  -> tts-cpp::tts-cpp
- qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions,
  SynthesisResult, forward-decls in ChatterboxModel.hpp)
- #include <qvac-tts/chatterbox/engine.h>
    -> #include <tts-cpp/chatterbox/engine.h>
- Doxygen / inline doc references to the old names refreshed alongside
  the code changes.

vcpkg wiring:
- vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg
  commit bc30b0b (ports/tts-cpp renamed and repointed at
  chatterbox.cpp@f8f9145).
- vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that
  carries the rename + namespace + install(EXPORT) changes).

Verified with a cold bare-make generate + bare-make build against the
new port, and the addon's existing unit + integration test suites.

Made-with: Cursor

* tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline

Picks up the round-3 review-fix wave landed on the tts-cpp port:

  e673182  scrub stale patches/ refs from README                (N10)
  8ba10a6  drop unreachable TTS_CPP_GGML_LIB_PREFIX block        (N8)
  4b5d2d7  mirror N1-N7 fixes from chatterbox.cpp source-of-truth
            - N1 supertonic alive-registry guard against freed-backend
              gallocr_free assert on hot-swap (Vulkan/Metal/CUDA)
            - N2 drop dead g_sink_* state, soften log_set docstring
            - N3 Turbo BPE try/catch (exception-safe Engine ctor)
            - N4 STFT cancel checkpoint + tighter Engine::cancel() doc
            - N5 document s3gen_preload/unload refcount semantics
            - N6 drop dead cached_text_lc Supertonic shim
            - N7 fix misleading "no copy" view-vs-copy log wording

Plus the integrated-port-only round-2 fixes that landed earlier:

  fa0d490  close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML
            now defaults ON; bundled-without-patches hard-errors at
            configure time with a pointer at the ggml-speech vcpkg
            port.
  ae34c58  README rewritten for integrated/vcpkg context.
  a2f2dd6  top-level qvac-ext-lib-whisper.cpp README points at the
            tts-cpp/ subtree (alongside parakeet-cpp/).

Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine /
EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is
backward-compatible: the new port adds Engine::backend_name(),
MTL-variant fields on EngineOptions (language / cfg_weight / min_p /
exaggeration), and a separate tts_cpp::supertonic::Engine class, but
nothing this consumer was already calling has changed.

Edits:

  packages/tts-ggml/vcpkg.json
    - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07.

  packages/tts-ggml/vcpkg-configuration.json
    - default-registry baseline: bc30b0b (April 2026 fork-only state)
      -> 16b91afdcfd59baea60e81f3da94f49311ef2a97.  The new baseline
      pulls in the post-tetherto-merge state (parakeet-cpp port at
      932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new
      tts-cpp port (16b91af) on the developer's GustavoA1604
      registry fork.

Smoke-test plan: after running `vcpkg install` against the new
baseline, the tts-cpp port's vcpkg_from_github resolves at
GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the
upstream PR merges.  ChatterboxModel should build and synthesize
identically; expanding to Multilingual + Supertonic flows is the
follow-up commit on the package side.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add chatterbox multilingual and supertonic

* Add mobile integration tests

* tts-ggml: drop clang-19 pin in linux-clang toolchain

The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary
names) since the package's first commit (0a2c978).  Linux CI hadn't
exercised this path before — the new on-pr-tts-ggml.yml -> integration
matrix is the first time it does, and it fails on every linux runner
(ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's
"detect_compiler" step because none of the GH-hosted images ship a
`clang-19` symlink:

  Detecting compiler hash for triplet x64-linux...
  error: while detecting compiler information:
  ...
  CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127
  (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE=
  .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ...

Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/
toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so
each runner picks up its image's default clang (clang-15 on
ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship).
The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake
is honoured by every reasonable clang version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add C++ tests and coverage; fix linux build

* tts-ggml: address PR review feedback

Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: unblock CI integration tests on every desktop runner

Four independent failures, one per platform:

1. linux-x64 / linux-arm64: addon load crashed at
   `libomp.so.5: cannot open shared object file`.  tts-cpp's binary is
   built with clang under the linux-clang toolchain and links against
   libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being
   apt-installed.  Add `libomp5` so libomp.so.5 is on the loader path.

2. darwin-arm64: convert-models.sh aborted at line 200 with
   `hf_args[@]: unbound variable`.  macOS's system bash is 3.2 which
   treats `"${arr[@]}"` as nounset access when the array is empty under
   `set -u`; with HF_TOKEN unset we hit it on every fresh runner.  Use
   the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six
   call sites and add a header comment so the next maintainer doesn't
   accidentally regress.

3. darwin-x64: pip install bombed building `llvmlite` from source
   because the macos-15-large runner has no LLVM 15 development
   install.  Root cause: librosa pulls in numba 0.65+, which stopped
   shipping darwin-x86_64 wheels for Python 3.12.  Pin Python to 3.11
   in the Setup Python step; 3.11 has prebuilt wheels for the entire
   numba/llvmlite/librosa stack on darwin-x64 and is fine for every
   other converter dependency.

4. windows-2022: ChatterboxModel::load threw
   `vk::createInstance: ErrorIncompatibleDriver`.  Root cause: the
   addon's index.js::_validateConfig defaults `useGPU = true` when
   neither useGPU nor nGpuLayers is specified, so the test ran with
   n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance ->
   ErrorIncompatibleDriver on the runner's no-Vulkan-driver image.
   runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'`
   (set on the no-GPU matrix entries) and forces useGPU=false on
   exactly those runners; the other test runners (chatterbox-mtl,
   gpu-smoke, multiple-runs) already had this guard.

Also documents the `mesa-vulkan-drivers` apt package (already pulled
in) as the software ICD that lets the Vulkan-built prebuild's runtime
backend probe enumerate at least one device on linux runners.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit)

Mobile build failed at `:app:createBundleReleaseJsAndAssets` with:

  SyntaxError: assets/testAssets/chatterbox-s3gen.gguf:
    Cannot create a string longer than 0x1fffffe8 characters

Root cause: Metro's bundler reads every asset under
`test/mobile/testAssets/` via `Buffer.toString()`.  V8's max string
length is 0x1fffffe8 (~512 MiB).  chatterbox-s3gen.gguf is ~1 GiB even
with --quant q4_0 because the s3gen converter only quantizes attention
weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight
tensors quantized" in the converter log).

Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the
limit) on mobile.  Mobile Chatterbox tests degrade cleanly to
`t.pass('Skipped: Chatterbox GGUFs not available')` via the existing
`ensureChatterboxModels` helper -- it already returns
{ success: false } when the GGUFs aren't on disk.

Cache key bumped to v2 so existing v1 cache entries (which include
the chatterbox files) are evicted on the next run.

Bundling Chatterbox on mobile requires either:
  - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the
    JS-string read is skipped (then the s3gen file can flow through the
    bundle as a raw asset), or
  - pushing the chatterbox GGUFs to the device via `adb push` outside
    the bundle and surfacing the path through downloadModel.js's
    existing ANDROID_CANDIDATE_DIRS fallback.

Both are outside the scope of this PR; documented inline above the
cache step for the next maintainer.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump hash of vcpkg

* Consume vcpkg from tetherto repository

* Fix integration tests failures in all platforms

* Further fix tests

* fix: Make useGPU flag more meaningful (#1953)

* fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts

* add gpu smoke test

* resolve comments

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

* Update dependencies after monorepo directory changes

* Further drop qvac-lib- prefix

* Add CHANGELOG.md

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com>
Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
lauripiisang added a commit that referenced this pull request May 14, 2026
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
lauripiisang added a commit that referenced this pull request May 14, 2026
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
lauripiisang added a commit that referenced this pull request May 14, 2026
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
lauripiisang added a commit that referenced this pull request May 14, 2026
#2030)

* QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store

Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id},
GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL
store, X-QVAC-Stub: responses-volatile header, and startup banner.

* fix: align Responses streaming with finalized response and add usage stats

- Approach (b): always include the assistant `message` item in `response.output[0]`,
  even when tool calls are present, so the streamed item tree matches `response.completed`.
- Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and
  the finalized `output[]`, fixing client-side accumulation by `item_id`.
- Use distinct `output_index` per tool call (1..n) and set `item_id` on
  `response.function_call_arguments.delta`/`.done` to the function-call item id
  (was the OpenAI `call_id`, causing collisions and wrong wiring).
- Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can
  satisfy tool calls instead of hanging in `requires_action` with no payload.
- Drop the duplicate `previous_response_id` lookup in `handlePostResponses`.
- Drop `parallel_tool_calls` from the unsupported-params log: it is honored.
- Recognise `function_call_output` (-> `tool` role) and `function_call`
  (-> synthesized assistant `<tool_call>` content) in
  `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained
  tool round-trips actually carry through `previous_response_id`.
- Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids.
- Surface real `usage.output_tokens` from `result.stats.generatedTokens`
  (Responses + chat.completions, blocking + streaming); fall back to word count
  when stats are missing. `input_tokens` stays 0 with an inline note that the SDK
  does not expose a prompt-token count today.
- Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape.

Tests: extend `responses.test.ts` and `translate.test.ts`; add
`responses-streaming.test.ts` driving the new exported `writeStreamingResponse` /
`writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.

* test[skiplog]: stabilize Responses chain e2e for tiny reasoning model

Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M
has room for both its <think> block and the actual answer. The test
exercises previous_response_id chain wiring; it should not depend on
sampling luck or the model's reasoning length.

* fix: walk previous_response_id chain so multi-turn keeps grandparent history

Each StoredResponse.inputItems only carries that turn's NEW input
(`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of
depth >= 3 silently lost the grandparent turn:

  resp_1 input "A" -> output "X"        (stored: ["A"])
  resp_2 prev=resp_1 input "B"           history sent: [A, X, B]
                                         (stored: ["B"])
  resp_3 prev=resp_2 input "C"           history sent: [B, Y, C]  -- A and X gone

historyPrefixFromStoredResponse now walks the chain via
responseObject.previous_response_id when given a resolver, prepending
earlier turns oldest-first. Cap depth at 32 to bound work and protect
against pathological cycles. Routes pass `(id) => store.get(id)` as the
resolver. Legacy single-step callers still work unchanged when the
resolver is omitted.

Tests:
- unit: depth-3 chain produces all six prefix entries in order; maxDepth
  cap honored.
- e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the
  word and recovers it -- would silently fail before this fix.

* fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.

* fix: address Simon review nits (stream error sentinel, input_items after cursor)

Two surfaced post-rebase:

1. sendError gained an opt-in { sseSentinel: false } so callers inside an
   active stream can suppress the trailing `data: [DONE]\n\n` after the
   `response.error` SSE event. Responses streaming error path now passes
   it, closing the gap that the happy path already handled (response.completed
   already used endSSE({ sentinel: false })).

2. GET /v1/responses/:id/input_items now reads the `after` cursor from the
   query string in addition to `limit`. Spec-compliant pagination would have
   re-fetched page 1 forever; the store already implemented the cursor.
   Added a store-level pagination test that walks all pages by `last_id`.
Proletter added a commit that referenced this pull request May 24, 2026
…oval bot

The tier-based approval check previously defaulted to tier 1 if no
\`tier2\` label was present, treating tier 1 as an implicit fallback.
Per QVAC-18190 the \`verified\` label landing in #1997 should also be
the explicit tier-1 marker, so reviewers can see in the bot comment
*why* a PR is at tier 1 (because it's verified, vs. because no tier
label is set).

Behaviour matrix (matches QVAC-18613 acceptance criteria):

| Labels on PR     | Tier  | Bot comment "Tier source"                          |
|------------------|-------|----------------------------------------------------|
| neither          | tier1 | default (no tier label applied)                    |
| \`verified\` only  | tier1 | \`verified\` label applied (explicit tier 1)         |
| \`tier2\` only     | tier2 | \`tier2\` label applied                              |
| both             | tier2 | \`tier2\` label applied (overrides \`verified\`)       |

Precedence rationale: \`tier2\` wins when both labels are present.
Stricter requirements take priority on conflict. The matrix above
is preserved verbatim in the bot comment so the reviewer can verify
the active rule without reading code.

Backwards compatibility: PRs with neither label keep the prior
behaviour exactly (tier 1 default, same requirements computation).
PRs with only \`tier2\` keep the prior behaviour exactly. The only
new code paths are the \`verified\`-only and both-labels cases.

Implementation:
- New \`hasVerified\` / \`hasTier2\` locals.
- New \`tierSource\` string threaded through
  \`checkTierRequirements\` and \`comment\` so the bot output mentions
  \`verified\` per AC #2.
- Bot comment gains a \`**Tier source:**\` line right under
  \`**PR Tier:**\` so the source is visible without expanding logs.

Validation plan:
- Marked draft pending #1997 (label-gate fan-out) merge so the
  \`verified\` label is actually applied in the wild.
- Test matrix walkthrough on a throwaway PR with each of the 4
  label combinations + \`/review\` comment to trigger the worker.
- Confirm the bot comment shows the expected "Tier source" string
  in each case and the requirementsMet decision is unchanged from
  the previous logic.

Co-authored-by: Cursor <cursoragent@cursor.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Minor fixes for clarity

* Removing unused vars

* Removing unused imports

* Removing unused python deps

---------

Co-authored-by: gianni <gianfranco.cordella@tether.io>
Proletter added a commit that referenced this pull request May 24, 2026
…oval bot

The tier-based approval check previously defaulted to tier 1 if no
\`tier2\` label was present, treating tier 1 as an implicit fallback.
Per QVAC-18190 the \`verified\` label landing in #1997 should also be
the explicit tier-1 marker, so reviewers can see in the bot comment
*why* a PR is at tier 1 (because it's verified, vs. because no tier
label is set).

Behaviour matrix (matches QVAC-18613 acceptance criteria):

| Labels on PR     | Tier  | Bot comment "Tier source"                          |
|------------------|-------|----------------------------------------------------|
| neither          | tier1 | default (no tier label applied)                    |
| \`verified\` only  | tier1 | \`verified\` label applied (explicit tier 1)         |
| \`tier2\` only     | tier2 | \`tier2\` label applied                              |
| both             | tier2 | \`tier2\` label applied (overrides \`verified\`)       |

Precedence rationale: \`tier2\` wins when both labels are present.
Stricter requirements take priority on conflict. The matrix above
is preserved verbatim in the bot comment so the reviewer can verify
the active rule without reading code.

Backwards compatibility: PRs with neither label keep the prior
behaviour exactly (tier 1 default, same requirements computation).
PRs with only \`tier2\` keep the prior behaviour exactly. The only
new code paths are the \`verified\`-only and both-labels cases.

Implementation:
- New \`hasVerified\` / \`hasTier2\` locals.
- New \`tierSource\` string threaded through
  \`checkTierRequirements\` and \`comment\` so the bot output mentions
  \`verified\` per AC #2.
- Bot comment gains a \`**Tier source:**\` line right under
  \`**PR Tier:**\` so the source is visible without expanding logs.

Validation plan:
- Marked draft pending #1997 (label-gate fan-out) merge so the
  \`verified\` label is actually applied in the wild.
- Test matrix walkthrough on a throwaway PR with each of the 4
  label combinations + \`/review\` comment to trigger the worker.
- Confirm the bot comment shows the expected "Tier source" string
  in each case and the requirementsMet decision is unchanged from
  the previous logic.

Co-authored-by: Cursor <cursoragent@cursor.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
CI run 25074595106 confirmed the two-phase test-side drain
(commit f6d1d5d) is sufficient for the upstream `OutputCallBackJs`
UAF on every platform: linux-x64/-arm64, darwin-arm64,
android-arm64, ios-arm64 all pass.

Only `win32-x64-integration-tests` still fails, and it does so for
a completely different upstream issue: the first
`js_create_double` call inside an `OutputCallBackJs` callback
returns 0.0 on win32-x64 (clang-cl + bare-runtime + V8) regardless
of the input. Subsequent calls in the same handle scope are
correct. The bug zeros out the highest-confidence value on every
classify() call, breaks the sort order, and trips
`meal_1.jpg "sorted desc [0]>=[1]"` (CI runs 24851301107,
24891210942, 24897445066, 24900278513, 25002820522, 25062157099,
25070800838, 25074595106).

There is no test-side workaround for this one. Sleeps don't help
because it isn't a lifecycle race. Other addons accidentally dodge
it for the reasons enumerated in the comment block at the top of
`AddonJs.hpp` (first emitted number is naturally 0; tests assert
only typeof / !isNaN; first number never asserted on; or no
numbers emitted at all). Our 3-class triage assertions cover none
of those, so the bug remains visible in CI.

Fix: restore the local C++ "burn one" workaround that was removed
in commit efbd683. A throwaway `js_create_double(env, 0.0,
&dummy)` call at the top of `JsClassifyOutputHandler`'s lambda
consumes the broken first slot; the per-element `Number::create`
calls that follow produce the correct value at index 0. The
throwaway value is never wired into the result array; cost is one
ephemeral js_number per classify() call.

The asymmetry between issues #1 (test-side sleep is enough) and
#2 (needs C++ workaround) is now documented at the top of
AddonJs.hpp -- including the CI runs that surfaced each, why the
test-side approach worked for one and not the other, and the
explicit rationale ("removed once upstream marshalling layer is
patched") for revisiting both.

Local validation on win32-x64:
- `bare-make build` clean.
- `npm run test:integration` 14/14 tests, 140/140 asserts (was
  failing on `meal_1.jpg sorted desc [0]>=[1]` before this).

Expected CI behaviour after this commit:

- Linux x64/arm64, Darwin arm64, Android arm64, iOS arm64 should
  keep passing (this commit doesn't touch their code paths).
- win32-x64 should now pass: the burn-one consumes the broken
  first slot and every per-element confidence marshalls correctly.

File: packages/qvac-lib-infer-ggml-classification/addon/src/addon/AddonJs.hpp
Made-with: Cursor
Proletter pushed a commit that referenced this pull request May 24, 2026
…ightsProvider (#1494)

* chore[bc]: remove BaseInference inheritance and WeightsProvider from LLM addon

Replace class inheritance with composable utilities from @qvac/infer-base@0.4.0:
- createJobHandler() for single-job lifecycle management
- exclusiveRunQueue() for run serialization
- Direct shard streaming via bare-fs instead of WeightsProvider

Constructor now takes { files: { model: string[], projectionModel?: string }, config, logger, opts }
instead of { loader, diskPath, modelName, projectionModel } + config.

All finetune, media, and filtered logger functionality preserved.

* fix: correct FinetuneProgress and finetune terminal handling in output callback

FinetuneProgress must call updateStats(data.stats), not updateOutput(data).
Finetune terminal JobEnded must call ended(data) as result, not updateStats.

* fix: update all LLM examples and model-loading test to new constructor shape

Update 13 examples and sharded model test to use files: { model: [...] } pattern.
Remove FilesystemDL dependency from all examples and tests.

* fix: update sharded model test to download shards to disk first

The network loader test used the old loader-based constructor.
Rewritten to download shards via HttpDL to disk, then pass absolute paths.

* fix: update LLM benchmark tooling to new constructor shape

* fix: update LLM perf benchmark sweep and judge to new constructor shape

* docs: update LLM README, finetuning, and afriquegemma docs for new constructor

* fix: update LLM prepare-prompts and verify-prompts to new constructor

* fix: update LLM finetuning unit tests to new constructor and exclusiveRunQueue

* docs: update LLM architecture, data-flows, finetuning, README sharded contract

* docs: align LLM finetuning docs and mobile README with new constructor

* chore[bc]: address PR #1494 review findings and bump to 0.15.0

Bumps `@qvac/llm-llamacpp` to `0.15.0` per the addon-changelog
process — minor bump on a pre-1.0 package signals the breaking
constructor change to consumers using semver ranges. Adds the
matching `0.15.0` block to `CHANGELOG.md` documenting the new
single-object constructor with `files`, the removal of
`BaseInference` + `WeightsProvider`, the dropped `destroy()`
method, the dependency churn, and every behaviour change in this
release.

Hardens the JS layer based on the review:

- Constructor now throws a clear `TypeError` when `files` /
  `files.model` is missing or empty, instead of crashing with an
  opaque "cannot read properties of undefined" later.
- `_runInternal` now throws "Addon not initialized. Call load()
  first." when invoked before `load()`, matching `finetune()` and
  the diffusion addon.
- `_load()` wraps `_streamShards` + `addon.activate()` in a
  try/catch that best-effort-unloads the partially-initialized
  native instance and resets `this.addon = null` so a subsequent
  `load()` does not leak a zombie addon.
- `createJobHandler({ cancel })` closure uses optional chaining so
  a stale `response.cancel()` after `unload()` is a no-op rather
  than a `TypeError`.
- `unload()` sets `this.addon = null` after `addon.unload()`, so
  the new `if (!this.addon)` guard in `_runInternal` is also
  effective post-unload.
- `pause()` and `cancel()` re-add the defensive `?.cancel` check.
- The `_load()` primary-path selection now picks the first entry
  matching the shard regex, replacing the fragile `[length - 1]`
  index. This stays compatible with the documented sharded order
  (`tensors.txt` first, shards second) and with the non-sharded
  single-file path; an inline comment explains the contract.
- The `_handleAddonOutputEvent` error log line now passes the
  `Error` object directly so loggers can format the full stack.

Drops dead `_isSuppressedNoResponseLog` /
`_createFilteredLogger` / `_originalLogger` plumbing. Those
existed to swallow `'No response found for job'` warnings emitted
by the old `BaseInference._jobToResponse` Map; the new
`createJobHandler`-based architecture cannot emit that message,
so the filter, the wrapped logger, and the `_originalLogger`
indirection are all gone. The user-supplied logger is now used
directly.

Restores JSDoc on every `FinetuneOptions` field in `index.d.ts`,
including default values (`numberOfEpochs = 1`,
`learningRate = 1e-4`, `batchSize = 128`, …) so IDE tooltips show
them without needing to read `docs/finetuning.md`.

* refactor: move LLM C++ event normalization into addon.js

Per the team-2 task doc (`TD-ADDON-INTERFACE-LLM-EMBED-SD.md`,
LLM section): "Move event name normalization from `index.js`
`_addonOutputCallback` into `addon.js` `LlamaInterface` — the
native binding wrapper should own the mapping from raw C++ events
to Output / Error / JobEnded / FinetuneProgress."

Adds `mapAddonEvent(rawEvent, data, error, state)` as a free
export from `addon.js`, co-located with `LlamaInterface`. The
function normalizes the C++-mangled event vocabulary into one of
`Output` / `Error` / `JobEnded` / `FinetuneProgress`, including:

- TPS-shaped runtime stats → JobEnded with `backendDevice`
  mapped from `0/1` to `'cpu'/'gpu'`.
- Finetune terminal payloads (`{op:'finetune', status, stats?}`)
  → JobEnded carrying the finetune payload, and arms the skip
  flag so the trailing TPS stats from the finetune are not
  dispatched as a fresh inference terminal.
- `finetune_progress` payloads → FinetuneProgress.
- Anything else with an `Error`-flavored event name → Error.
- String payloads → Output.

`LlmLlamacpp._addonOutputCallback` becomes a thin shim that
imports `mapAddonEvent`, hands it the per-instance state object
(now `this._addonEventState = { skipNextRuntimeStats }` instead
of the bare `_skipNextRuntimeStats` field), and forwards the
mapped event to `_handleAddonOutputEvent`.

Stateful flag lives on the model so unit tests can still poke at
it via `model._addonEventState.skipNextRuntimeStats`. Updated all
9 references in `test/unit/finetuning.test.js`. All 31 unit
tests still pass; lint and dts checks clean.

Also fixes the misleading JSDoc on `LlamaInterface.loadWeights`:
the native binding reads the JS property name `chunk` (verified
in `qvac-lib-inference-addon-cpp/JsBlobsStream.hpp::appendBlob`,
lines 41–42 and 66–67), not `contents`. The C++ local variable
is named `contents`, which is what the proposal text was
referencing — but the on-the-wire JS property name is `chunk`
and the JS layer call sites are correct.

* fix: address PR #1494 second-round review findings

1. `test/integration/http-loader.js` no longer extends
   `@qvac/dl-base`. The base class was only providing a `close()`
   shim around `_close()`, and the package's devDependencies no
   longer list `@qvac/dl-base` after the loader-removal refactor.
   The helper now stands on its own — `getStream()` and `close()`
   are the only methods the sharded model-loading test calls, so
   the rest of the BaseDL surface (including the unused
   `getFileSize` and `list`) is dropped. Removes the dangling
   require that would break a clean install of this package and
   block the sharded test in CI.

2. `examples/multiModal.js` no longer passes `content: imageFilePath`
   on the second `media` message. The native binding only accepts
   `Uint8Array` payloads on `media` messages — file paths were
   silently broken after the loader removal. The example now
   reuses the same `imageBuffer` for both inferences and uses a
   different prompt on the second one to keep the example
   pedagogically distinct.

3. `index.d.ts` `AddonMessage` now exposes the optional
   `generationParams?: GenerationParams` field. The runtime path
   in `LlmLlamacpp._runInternal` already serializes this field
   onto every text message it forwards through `addon.runJob`,
   but the published transport type omitted it — IDE consumers
   building their own message-shaped payloads would lose the
   per-call overrides. The field documents that it is forwarded
   from `RunOptions.generationParams` and is the canonical way
   to vary sampling per request without re-loading the model.

* fix: extract pickPrimaryGgufPath, restore multiModal example, fix docs

- Extract shard-picker logic into named pickPrimaryGgufPath() with unit
  tests documenting the contract (tensors.txt-first ordering, single-file
  fallback). Move SHARD_REGEX inside the function.
- Revert multiModal.js to original: first inference uses Uint8Array,
  second uses string path. Both C++ code paths work. Remove false comment
  claiming file paths are not supported.
- Restore stripped JSDoc on FinetuneValidationSplit.fraction and
  FinetuneValidationDataset.path in index.d.ts.
- Fix docs/architecture.md and docs/data-flows-detailed.md: 4 occurrences
  incorrectly said "last" shard is the primary path; actual code picks
  the first shard regex match.
- Hardcode shard filenames in model-loading integration test instead of
  generating them via regex.
- Add network streaming capability loss note to CHANGELOG.

* fix: correct version in architecture.md and remove stale dl-filesystem benchmark dep

- docs/architecture.md header: v0.14.3 → v0.15.0 to match package.json
- benchmarks/performance/package.json: remove @qvac/dl-filesystem (no
  longer used after FilesystemDL references were removed from all
  benchmark JS files)

* fix: align _hasActiveResponse clearing with embed pattern

Remove the synchronous clear in _handleAddonOutputEvent on JobEnded/Error.
The .finally() on response.await() already clears the flag when the response
promise settles, and exclusiveRunQueue serializes _runInternal so the next
call cannot race the current one. Matches the embed addon's pattern, where
.finally() is the sole clear path outside of unload().

* fix: throw on second load(), log rejected responses, add mapAddonEvent unit test

- load(): throw if already loaded. Caller must unload() first. Aligns
  with the team consensus (Yury/Gianfranco/Gustavo) — silent reload
  masks caller bugs. unload() already clears configLoaded.
- _runInternal / finetune: replace silent `finalized.catch(() => {})`
  with a warn-level log so rejected responses are not swallowed when
  the caller does not await.
- test/unit/map-addon-event.test.js: new unit test covering TPS stats
  mapping + backendDevice translation, skipNextRuntimeStats dropping,
  finetune terminal + skip-flag arming, finetune_progress, Error event,
  string-as-token Output, and default fall-through.
- CHANGELOG 0.15.0: document the load() throw.

* fix: restore JSDoc on run() that was dropped during BaseInference removal

The JSDoc documenting run()'s prompt and runOptions parameters was
accidentally removed during the BaseInference removal refactor when
run() was split into run() + _runInternal(). Restore it on the public
run() method, and reference the full RunOptions type (which already
documents prefill / generationParams / cacheKey / saveCacheToDisk in
index.d.ts) so the docs stay authoritative in one place.

* fix: migrate afriquegemma-edge-cases test to new addon constructor

The afriquegemma-edge-cases.test.js file came in via the upstream/main
merge but still used the pre-refactor constructor shape:
  new LlmLlamacpp({ loader, modelName, diskPath, ... }, config)
with a FilesystemDL loader. All 7 tests in the file are now migrated to:
  new LlmLlamacpp({ files: { model: [path.join(dirPath, modelName)] },
                    config, logger, opts })
Removed FilesystemDL import and all loader.close() calls. Added
isMobile skip flag matching the pattern in afriquegemma-translation.

Caught by the qvac-staff-code-reviewer agent as a "merge brought in a
new consumer of the old API" — restore-the-class issue across the family.

* fix: make load() idempotent when already loaded

Second load() on an already-loaded instance returns immediately instead
of throwing. Matches the ReadyResource pattern used elsewhere in QVAC:
open/load is idempotent; explicit unload() is required to swap weights.

CHANGELOG updated.

* test: regenerate mobile integration auto.cjs

Integration test files were touched during the refactor and the
generated mobile harness was not regenerated. `npm run test:mobile:generate`
output committed so `validate-mobile-tests.js` passes.

* doc: document missing breaking changes from BaseInference removal

Address feedback to report all breaking changes from the BaseInference
refactor, not just the constructor shape:

- getState() narrows from {configLoaded, weightsLoaded, destroyed}
  to {configLoaded} only
- LlmLlamacpp public methods removed: downloadWeights, unpause, stop,
  status, destroy, getApiDefinition (destroy was already mentioned;
  other five were missing)
- load() takes no arguments (was (closeLoader, onDownloadProgress))
- Type exports removed from index.d.ts: ReportProgressCallback,
  Loader, DownloadWeightsOptions, DownloadResult

Also fix the stale (0.15.0) version marker in the AFTER code block.

* fix: address lifecycle, validation, and CI-surface review findings

- load() now runs through `this._run()` so concurrent calls on the same
  instance serialize instead of racing past the `configLoaded` guard.
  Two overlapping loads could previously both allocate a native addon
  and clobber `this.addon`, leaking one native handle.
- Constructor now validates each `files.model` entry with
  `path.isAbsolute()` and applies the same check to the optional
  `files.projectionModel` (which previously had no validation at all).
  Relative paths are rejected at construction time instead of bubbling
  up from bare-fs / native load.
- `pickPrimaryGgufPath` is now declared in `index.d.ts` so the TS
  surface matches the CommonJS export at `index.js`.
- Add `test:unit` and `test:unit:generate` scripts that run the JS
  unit tests under `test/unit/*.test.js` via brittle + bare. Wire
  `test:unit` into `test:all` and into the PR workflow's ts-checks
  job so `map-addon-event.test.js`, `pick-primary-gguf-path.test.js`,
  and the pre-existing `finetuning.test.js` all run on every PR.

* doc: add CHANGELOG entries for load() serialization and absolute-path validation

* fix[ci]: run test:unit via run-lint-and-unit-tests action

Replace my hand-rolled test:unit step (which invoked `bare` in a job
that never installs it) with the existing run-lint-and-unit-tests
external action. Same pattern qvac-lib-infer-onnx and ocr-onnx already
use. The action installs bare globally and runs
`npm run test:unit --if-present`.

Also chain test:unit into the `test` script for local dev convenience,
matching the standalone-repo precedent (qvac-lib-inference-addon-base,
qvac-lib-dl-filesystem, etc.).

* doc: fix mermaid parsing errors in architecture.md and finetuning.md

architecture.md:159 — mermaid classDiagram uses { } as class-body
delimiters; the inline destructured-object syntax in the constructor
signature broke parsing. Replace with the canonical named type
`LlmLlamacppArgs` from index.d.ts so the class diagram renders.

finetuning.md:251 — sequence-diagram message contained `(_run)` and
`_hasActiveResponse` where the leading underscore was being
interpreted as mermaid italic-open, and slashes in
`validationSplit/useEvalDatasetForValidation/evalDatasetPath` made
the message ambiguous. Reword to use prose-style commas and drop the
leading-underscore identifiers.

Reported by maxim-smotrov.

* chore[ci]: rename step to reflect what the action actually runs

The run-lint-and-unit-tests action runs `npm run lint` and
`npm run test:unit` (and installs bare in between). The step name
"Run JavaScript tests" hides the lint half. Rename to
"Run lint and unit tests" and update the step id accordingly.

* fix: readme, finetune lifecycle, multimodal type

README quickstart, sharded, and OCR examples now use `path.resolve('./models')`
so the resulting `files.model` entries and `files.projectionModel` are
absolute. The refactored constructor rejects relative paths, which meant
the README snippets threw `TypeError` when copied verbatim.

`finetune()` moves the `!this.addon` readiness check and the
`_checkpointSaveDir` assignment inside the `this._run(...)` closure,
matching the pattern `run()` uses via `_runInternal`. If `unload()` is
already queued ahead of `finetune()`, the guard now runs after
`unload()` nulls `this.addon` instead of before, so the caller gets the
intended "Call load() first." error rather than a null-dereference
crash inside the queued body.

`UserMediaMessage.content` widens from `Uint8Array` to `Uint8Array | string`.
The C++ layer has always accepted both (raw bytes go through `parseMedia`;
string paths go through `loadMedia` in LlamaModel.cpp), and the OCR /
multimodal examples exercise the string-path form. The d.ts was
inadvertently narrower than the runtime contract.

* fix: preserve LogMsg event name in mapAddonEvent

Native `JsLogMsgOutputHandler` emits log events whose payload is a
plain string (`js::String::create(env, logMsg)`). The old mapping had
a generic `typeof rawData === 'string'` fallback that remapped every
string-payload event to `Output`, so any native LogMsg was quietly
pushed into the job output stream instead of the logger. The
`_handleAddonOutputEvent` branch that routes `LogMsg` to
`this.logger.info()` was therefore unreachable.

Check the `LogMsg` event name before the string-to-Output fallback so
log messages keep their type and reach the logger. Add a unit test
covering the precedence.

* doc: restore class JSDoc, method JSDoc, and media-separation comments

Restore documentation that the refactor dropped but whose content is
still accurate against the refactored code:

- Class-level JSDoc on LlmLlamacpp describing what the class does.
- Short JSDoc on pause(), cancel(), and unload() explaining each method's
  purpose, including how pause() saves a resumable checkpoint and how
  cancel() wipes it so the next finetune() starts fresh.
- Inline comments in _runInternal explaining the media/text separation:
  binary blobs go into promptMessages as type: 'media' entries in order,
  then the JSON text payload carries empty-content placeholders for each
  media item so tokenization can align.

* doc: shorten pickPrimaryGgufPath JSDoc in d.ts to a single line

Declaration-file JSDoc surfaces in IDE hover tooltips, so multi-paragraph
prose is noise. Trim to a one-liner covering the only behavior the type
hover needs to convey. The "exported for unit testing" rationale is
dropped since consumers do not need it on the type surface.

* doc: trim verbose comments added during the refactor

Tighten comments this PR introduced that drifted into over-explanation.
Leave pre-existing comments as-is.

- addon.js mapAddonEvent JSDoc: drop the multi-paragraph prose about
  C++ event naming and stateful ordering; keep the one-sentence
  contract plus the param block.
- index.js pickPrimaryGgufPath JSDoc: replace the multi-paragraph
  explanation of the caller's shard-list contract with a single-line
  summary citing the C++ regex contract.
- index.js class header on LlmLlamacpp: reduce to a single purpose line.
- index.js constructor block: shorten the lazy-deref rationale and the
  _addonEventState comment to one line each.
- index.js _addonOutputCallback: reduce the three-line comment
  pointing at addon.js to a single line. The detailed rationale is
  already in addon.js mapAddonEvent JSDoc.
- index.js media-separation comment: restore the one-line wording that
  already existed on main; earlier revision expanded it into three
  lines unnecessarily.

* doc: drop narration comment on _addonOutputCallback

The comment said "Event-name normalization lives in addon.js
(mapAddonEvent)", but the very next line imports and calls
mapAddonEvent — the code already tells the reader where event mapping
lives. Remove the line so the code speaks for itself.

* doc: restore FinetuneOptions JSDoc to pre-refactor forms

The refactor commit unintentionally rephrased FinetuneOptions JSDoc
lines that the refactor itself did not change. Revert those fields back
to main's original wording so the diff only carries structural changes
tied to the interface migration.

* doc: restore pre-refactor load/createAddon logs and JSDoc

The refactor commit silently dropped the _load() progress logs ('Creating
addon with configuration', 'Activating addon'), the 'Error during model
load' error log, and the JSDoc block on _createAddon(). Put them back so
the refactor only changes what needs to change.

* chore: drop unused 'test' script, inline into 'test:all'

The 'test' alias was only consumed by 'test:all', and neither was
referenced in CI workflows or the README. 'test:all' ran test:unit
twice because it called both test:unit and the 'test' alias. Remove
'test' and rewrite 'test:all' to run test:unit, test:integration, and
test:cpp directly.

* doc: correct pre-refactor constructor marker to <= 0.15.x

0.15.x still used the old (args, config) constructor shape; the old
example applies to any 0.15.x caller, not just 0.14.x. Align the
CHANGELOG marker with the PR body.

* test: run AfriqueGemma tests on mobile, matching main

The backmerge of upstream/main carried a stale 'skip: isMobile' from
the pre-refactor translation test into the six new translation tests
and the edge-cases migration. Main's c1cc8c0 deliberately dropped
the mobile skip; restore that intent. The isMobile constant is
unused after this and dropped.

* doc, test: fix _createAddon JSDoc and cover string-path media content

_createAddon() JSDoc referenced 'configurationParams.settings' and
omitted 'projectionPath'. The actual shape built in _load() is
{ path, projectionPath, config }; align the JSDoc with that.

UserMediaMessage.content widened to Uint8Array | string earlier in
this PR but no integration test exercised the string-path branch.
Add one elephant-image test that passes the absolute path as
message content, exercising the loadMedia(string) path through the
JS-to-C++ handoff.

* build: promote @qvac/logging to runtime dependency

index.js requires('@qvac/logging') at runtime, so it belongs under
dependencies, not devDependencies. Previously it worked only because
another runtime dep pulled it in transitively — fragile for publish
and can break under stricter package managers.

* doc: finish finetuning.md mermaid fix

Previous commit 979a070 reworded only my own addition (line 251) but
the block still failed at the same position because the surrounding
pre-existing message bodies still used ; as a statement separator.
Mermaid sequenceDiagram parses ; as end-of-statement, so every message
containing it broke the diagram.

Replace ; with , or a separator word across all four affected lines
(block #1 lines 251, 256, 266 and block #2 line 296) so the finetune
and pause flow diagrams render on GitHub.

* fix: move addon construction into crash-safe try block

_createAddon() was outside the try so a synchronous throw in
require('./binding') or binding.createInstance() would leave
this.addon set to a partial native handle and never reach the
cleanup path. Route addon construction through the same try the
shard-streaming and activate() calls use.

---------

Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Proletter pushed a commit that referenced this pull request May 24, 2026
…#1983)

* feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp)

New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed
by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg).  API-compatible
with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream
consumers can swap backends without touching orchestration code.

## Scope

* First iteration.  Supports Chatterbox **English** only.  Chatterbox
  multilingual, LavaSR enhancer, Supertonic engine, and streaming are
  out of scope and remain in `@qvac/tts-onnx`.  They'll land alongside
  the evolution of qvac-tts.cpp.
* Native backend is the static `qvac-tts` library from the QVAC vcpkg
  registry (`ports/tts-cpp`, baseline `2026-04-21`).  No ONNX Runtime
  dependency.

## JS surface

* `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as
  `ONNXTTS`:  `run` / `runStream` / `runStreaming` / `reload` /
  `unload` / `destroy`.
* `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` +
  `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` /
  `files.s3genModel` override the defaults.
* Options: `referenceAudio`, `voiceDir` (baked profile), `seed`,
  `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for
  the upcoming streaming flags (`streamChunkTokens`,
  `streamFirstChunkTokens`, `cfmSteps`).
* Shared reusable lib code (`lib/textChunker.js`,
  `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim
  from `@qvac/tts-onnx`.
* New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000**
  to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both
  packages are loaded in the same Bare process.

## Native addon

* `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` —
  `IModel` + `IModelCancel` implementation.  First-iteration strategy:
  assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output
  path, call it synchronously, then parse the resulting 16-bit mono
  PCM wav back into `std::vector<int16_t>` for the JS handler.
  Consequences: every job re-loads the model (~700 ms + inference
  time), no mid-synthesis cancellation, no streaming.  The follow-up
  milestone replaces this with a persistent, struct-based API once
  qvac-tts.cpp exposes one.
* `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++
  config bridging (same string-map pattern as `@qvac/tts-onnx`) and the
  `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing
  `createInstance` / `runJob` / `reload` / `activate` / `cancel` /
  `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`.
* `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob`
  / `reload` wrappers that register a `JsAudioOutputHandler` emitting
  `{ outputArray: Int16Array, sampleRate: number }` to JS.

## Build / registry

* `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)`
  and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape
  matches `@qvac/transcription-whispercpp`).
* `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough)
  plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`.
* `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg.
  NOTE: the baseline pin here is inherited from
  `@qvac/transcription-whispercpp` and **must be bumped** to a commit
  that contains the `tts-cpp` port once that registry PR lands.  A
  follow-up commit will update it.

## Tests & examples

* Integration + unit test files for Chatterbox English are copied
  verbatim from `@qvac/tts-onnx` with only mechanical renames
  (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`,
  `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`).  Some
  paths in `test/integration/addon.test.js` still import Supertonic /
  LavaSR helpers that don't exist in this package — those test blocks
  will fail fast when the file loads, which is expected until those
  backends get their own ggml packages.
* Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus
  shared `wav-helper.js` + `pcm-chunk-player.js`.

## What's not in this PR (known gaps)

* No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes
  will land in a single documentation pass once the registry + fork
  commits have merged upstream.
* `vcpkg-configuration.json` baseline needs to point at a
  qvac-registry-vcpkg commit that ships `tts-cpp` (pending the
  registry PR).
* Actual `npm run build` requires the registry and fork commits to be
  on `main` of their respective upstream repos.

* chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit

Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg
at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that
adds the `tts-cpp` port.  Paired with the `qvac-tts` library already
pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp
@ 0fe4a521618cc30358040b29d75d4261b31cbb60).

Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry
PR lands upstream.

* chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper

Second pass over @qvac/tts-ggml after the build started passing: prune
everything that only made sense for the ONNX-era multi-engine scope and
adapt the remaining Chatterbox-English bits to the GGUF + file-path
reference-audio contract.  Restores `test/mobile/` so the Android build
has something to point at.

## C++

* `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment
  contained `**/` which closed the block comment early and broke the
  build.  Rewrote as a `//` comment.

## Examples

* `examples/chatterbox-tts.js` — rewrite for v0 contract: single
  `<text>` argv, `files: { modelDir }` pointing at the two GGUFs,
  `referenceAudio` is now a wav **path** (addon passes it to
  `--reference-audio`) instead of a Float32Array.  Drops
  english/multilingual arg and the CHATTERBOX_VARIANT switch that
  picked which `.onnx` files to load.
* Removed `examples/chatterbox-streaming-tts.js` +
  `examples/pcm-chunk-player.js`.  The v0 addon re-loads the model
  per `run()` call — exposing streaming would mislead.  Both come
  back alongside the persistent-engine milestone.
* `package.json`: `npm run example` now passes a default text so it
  runs without extra args.

## Tests

### Kept as-is (engine-agnostic)

* `test/unit/textChunker.test.js`
* `test/mock/{MockedBinding,utils}.js`
* `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js`
* `test/reference-audio/jfk.wav`, `test/data/sentences-*.js`

### Mechanical fixes

* `test/unit/tts.error.test.js` — fix error-code assertions to the
  tts-ggml range (`13001–14000`); was still checking the
  `@qvac/tts-onnx` range (`7001–7011`).
* `test/unit/tts-ggml.lifecycle.test.js` — fix stale
  `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the
  stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the
  non-existent `engine: 'chatterbox'` option.
* `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine
  cleanup.

### Rewritten

* `test/unit/chatterbox.inference.test.js` — drop tests that asserted
  the old ONNX file shape (`tokenizer / speechEncoder / embedTokens /
  conditionalDecoder / languageModel`), the removed `engine` detection
  and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`).
  New tests cover: `modelDir` derives the two GGUF paths; explicit
  `t3Model` / `s3genModel` override the defaults.  The mocked-binding
  run/reload/cancel flow stays.
* `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English
  only.  Ensures the GGUFs are present, runs the short sentence set
  through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and
  (on darwin only) runs a whisper-based WER check via the existing
  `runWhisper` util.  Drops the Chatterbox-multilingual block + every
  Supertonic + LavaSR block that doesn't apply to this package.
* `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract:
  `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a
  file path that falls back to `test/reference-audio/jfk.wav` (or the
  mobile test-asset when `global.assetPaths` is present).  No more
  WAV decode / resample on the JS side.
* `test/utils/downloadModel.js` — trim from 1007 LoC to 280.  Drops
  the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie
  downloaders.  Keeps the shared HTTP/curl infrastructure and
  `ensureWhisperModel` (still used by the integration WER check).
  `ensureChatterboxModels` is now **check-only**: it verifies
  `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally
  and, if missing, prints the exact commands for generating them
  from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts.
  Once the GGUFs land on a canonical HuggingFace repo we'll wire up
  download URLs here.

## Scripts

* `scripts/ensure-chatterbox.js` — simplify to a single invocation
  against `./models/`.  Drops the variant / language matrix that the
  ONNX downloader needed.
* `scripts/ensure-models.js` — now a thin alias to
  `ensure-chatterbox.js`.  Drops the Supertonic + LavaSR orchestration.

## Mobile

* Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs,
  testAssets/jfk.wav}` so the Android build has a wrapper to point at.
* `package.json`: re-added `test/mobile` to the `files` list.

## Gitignore

* Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp`
  (produced by the top-level `configure_file(...)` calls) and
  `build_*/` dirs (bare-make convention).

## Verified locally

* `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean.
* `npm run test:unit` — 38/38 pass (105/105 asserts).
* `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."`
  produces a 24 kHz wav as expected.

* Add streaming support

* Update ggml backend to use separate ggml repo

* tts-ggml: consume renamed tts-cpp library (2026-04-24#1)

Upstream chatterbox.cpp renamed the package + namespace + target from
qvac-tts to tts-cpp and tightened the library boundary; pick up the
new artefacts here:

- find_package(qvac-tts-cpp CONFIG REQUIRED)
    -> find_package(tts-cpp CONFIG REQUIRED)
- qvac-tts::qvac-tts  -> tts-cpp::tts-cpp
- qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions,
  SynthesisResult, forward-decls in ChatterboxModel.hpp)
- #include <qvac-tts/chatterbox/engine.h>
    -> #include <tts-cpp/chatterbox/engine.h>
- Doxygen / inline doc references to the old names refreshed alongside
  the code changes.

vcpkg wiring:
- vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg
  commit bc30b0b (ports/tts-cpp renamed and repointed at
  chatterbox.cpp@f8f9145).
- vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that
  carries the rename + namespace + install(EXPORT) changes).

Verified with a cold bare-make generate + bare-make build against the
new port, and the addon's existing unit + integration test suites.

Made-with: Cursor

* tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline

Picks up the round-3 review-fix wave landed on the tts-cpp port:

  e673182  scrub stale patches/ refs from README                (N10)
  8ba10a6  drop unreachable TTS_CPP_GGML_LIB_PREFIX block        (N8)
  4b5d2d7  mirror N1-N7 fixes from chatterbox.cpp source-of-truth
            - N1 supertonic alive-registry guard against freed-backend
              gallocr_free assert on hot-swap (Vulkan/Metal/CUDA)
            - N2 drop dead g_sink_* state, soften log_set docstring
            - N3 Turbo BPE try/catch (exception-safe Engine ctor)
            - N4 STFT cancel checkpoint + tighter Engine::cancel() doc
            - N5 document s3gen_preload/unload refcount semantics
            - N6 drop dead cached_text_lc Supertonic shim
            - N7 fix misleading "no copy" view-vs-copy log wording

Plus the integrated-port-only round-2 fixes that landed earlier:

  fa0d490  close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML
            now defaults ON; bundled-without-patches hard-errors at
            configure time with a pointer at the ggml-speech vcpkg
            port.
  ae34c58  README rewritten for integrated/vcpkg context.
  a2f2dd6  top-level qvac-ext-lib-whisper.cpp README points at the
            tts-cpp/ subtree (alongside parakeet-cpp/).

Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine /
EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is
backward-compatible: the new port adds Engine::backend_name(),
MTL-variant fields on EngineOptions (language / cfg_weight / min_p /
exaggeration), and a separate tts_cpp::supertonic::Engine class, but
nothing this consumer was already calling has changed.

Edits:

  packages/tts-ggml/vcpkg.json
    - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07.

  packages/tts-ggml/vcpkg-configuration.json
    - default-registry baseline: bc30b0b (April 2026 fork-only state)
      -> 16b91afdcfd59baea60e81f3da94f49311ef2a97.  The new baseline
      pulls in the post-tetherto-merge state (parakeet-cpp port at
      932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new
      tts-cpp port (16b91af) on the developer's GustavoA1604
      registry fork.

Smoke-test plan: after running `vcpkg install` against the new
baseline, the tts-cpp port's vcpkg_from_github resolves at
GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the
upstream PR merges.  ChatterboxModel should build and synthesize
identically; expanding to Multilingual + Supertonic flows is the
follow-up commit on the package side.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add chatterbox multilingual and supertonic

* Add mobile integration tests

* tts-ggml: drop clang-19 pin in linux-clang toolchain

The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary
names) since the package's first commit (0a2c978).  Linux CI hadn't
exercised this path before — the new on-pr-tts-ggml.yml -> integration
matrix is the first time it does, and it fails on every linux runner
(ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's
"detect_compiler" step because none of the GH-hosted images ship a
`clang-19` symlink:

  Detecting compiler hash for triplet x64-linux...
  error: while detecting compiler information:
  ...
  CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127
  (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE=
  .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ...

Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/
toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so
each runner picks up its image's default clang (clang-15 on
ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship).
The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake
is honoured by every reasonable clang version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add C++ tests and coverage; fix linux build

* tts-ggml: address PR review feedback

Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: unblock CI integration tests on every desktop runner

Four independent failures, one per platform:

1. linux-x64 / linux-arm64: addon load crashed at
   `libomp.so.5: cannot open shared object file`.  tts-cpp's binary is
   built with clang under the linux-clang toolchain and links against
   libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being
   apt-installed.  Add `libomp5` so libomp.so.5 is on the loader path.

2. darwin-arm64: convert-models.sh aborted at line 200 with
   `hf_args[@]: unbound variable`.  macOS's system bash is 3.2 which
   treats `"${arr[@]}"` as nounset access when the array is empty under
   `set -u`; with HF_TOKEN unset we hit it on every fresh runner.  Use
   the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six
   call sites and add a header comment so the next maintainer doesn't
   accidentally regress.

3. darwin-x64: pip install bombed building `llvmlite` from source
   because the macos-15-large runner has no LLVM 15 development
   install.  Root cause: librosa pulls in numba 0.65+, which stopped
   shipping darwin-x86_64 wheels for Python 3.12.  Pin Python to 3.11
   in the Setup Python step; 3.11 has prebuilt wheels for the entire
   numba/llvmlite/librosa stack on darwin-x64 and is fine for every
   other converter dependency.

4. windows-2022: ChatterboxModel::load threw
   `vk::createInstance: ErrorIncompatibleDriver`.  Root cause: the
   addon's index.js::_validateConfig defaults `useGPU = true` when
   neither useGPU nor nGpuLayers is specified, so the test ran with
   n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance ->
   ErrorIncompatibleDriver on the runner's no-Vulkan-driver image.
   runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'`
   (set on the no-GPU matrix entries) and forces useGPU=false on
   exactly those runners; the other test runners (chatterbox-mtl,
   gpu-smoke, multiple-runs) already had this guard.

Also documents the `mesa-vulkan-drivers` apt package (already pulled
in) as the software ICD that lets the Vulkan-built prebuild's runtime
backend probe enumerate at least one device on linux runners.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit)

Mobile build failed at `:app:createBundleReleaseJsAndAssets` with:

  SyntaxError: assets/testAssets/chatterbox-s3gen.gguf:
    Cannot create a string longer than 0x1fffffe8 characters

Root cause: Metro's bundler reads every asset under
`test/mobile/testAssets/` via `Buffer.toString()`.  V8's max string
length is 0x1fffffe8 (~512 MiB).  chatterbox-s3gen.gguf is ~1 GiB even
with --quant q4_0 because the s3gen converter only quantizes attention
weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight
tensors quantized" in the converter log).

Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the
limit) on mobile.  Mobile Chatterbox tests degrade cleanly to
`t.pass('Skipped: Chatterbox GGUFs not available')` via the existing
`ensureChatterboxModels` helper -- it already returns
{ success: false } when the GGUFs aren't on disk.

Cache key bumped to v2 so existing v1 cache entries (which include
the chatterbox files) are evicted on the next run.

Bundling Chatterbox on mobile requires either:
  - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the
    JS-string read is skipped (then the s3gen file can flow through the
    bundle as a raw asset), or
  - pushing the chatterbox GGUFs to the device via `adb push` outside
    the bundle and surfacing the path through downloadModel.js's
    existing ANDROID_CANDIDATE_DIRS fallback.

Both are outside the scope of this PR; documented inline above the
cache step for the next maintainer.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump hash of vcpkg

* Consume vcpkg from tetherto repository

* Fix integration tests failures in all platforms

* Further fix tests

* fix: Make useGPU flag more meaningful (#1953)

* fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts

* add gpu smoke test

* resolve comments

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

* Update dependencies after monorepo directory changes

* Further drop qvac-lib- prefix

* Add CHANGELOG.md

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com>
Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
Proletter pushed a commit that referenced this pull request May 24, 2026
#2030)

* QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store

Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id},
GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL
store, X-QVAC-Stub: responses-volatile header, and startup banner.

* fix: align Responses streaming with finalized response and add usage stats

- Approach (b): always include the assistant `message` item in `response.output[0]`,
  even when tool calls are present, so the streamed item tree matches `response.completed`.
- Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and
  the finalized `output[]`, fixing client-side accumulation by `item_id`.
- Use distinct `output_index` per tool call (1..n) and set `item_id` on
  `response.function_call_arguments.delta`/`.done` to the function-call item id
  (was the OpenAI `call_id`, causing collisions and wrong wiring).
- Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can
  satisfy tool calls instead of hanging in `requires_action` with no payload.
- Drop the duplicate `previous_response_id` lookup in `handlePostResponses`.
- Drop `parallel_tool_calls` from the unsupported-params log: it is honored.
- Recognise `function_call_output` (-> `tool` role) and `function_call`
  (-> synthesized assistant `<tool_call>` content) in
  `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained
  tool round-trips actually carry through `previous_response_id`.
- Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids.
- Surface real `usage.output_tokens` from `result.stats.generatedTokens`
  (Responses + chat.completions, blocking + streaming); fall back to word count
  when stats are missing. `input_tokens` stays 0 with an inline note that the SDK
  does not expose a prompt-token count today.
- Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape.

Tests: extend `responses.test.ts` and `translate.test.ts`; add
`responses-streaming.test.ts` driving the new exported `writeStreamingResponse` /
`writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.

* test[skiplog]: stabilize Responses chain e2e for tiny reasoning model

Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M
has room for both its <think> block and the actual answer. The test
exercises previous_response_id chain wiring; it should not depend on
sampling luck or the model's reasoning length.

* fix: walk previous_response_id chain so multi-turn keeps grandparent history

Each StoredResponse.inputItems only carries that turn's NEW input
(`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of
depth >= 3 silently lost the grandparent turn:

  resp_1 input "A" -> output "X"        (stored: ["A"])
  resp_2 prev=resp_1 input "B"           history sent: [A, X, B]
                                         (stored: ["B"])
  resp_3 prev=resp_2 input "C"           history sent: [B, Y, C]  -- A and X gone

historyPrefixFromStoredResponse now walks the chain via
responseObject.previous_response_id when given a resolver, prepending
earlier turns oldest-first. Cap depth at 32 to bound work and protect
against pathological cycles. Routes pass `(id) => store.get(id)` as the
resolver. Legacy single-step callers still work unchanged when the
resolver is omitted.

Tests:
- unit: depth-3 chain produces all six prefix entries in order; maxDepth
  cap honored.
- e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the
  word and recovers it -- would silently fail before this fix.

* fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.

* fix: address Simon review nits (stream error sentinel, input_items after cursor)

Two surfaced post-rebase:

1. sendError gained an opt-in { sseSentinel: false } so callers inside an
   active stream can suppress the trailing `data: [DONE]\n\n` after the
   `response.error` SSE event. Responses streaming error path now passes
   it, closing the gap that the happy path already handled (response.completed
   already used endSSE({ sentinel: false })).

2. GET /v1/responses/:id/input_items now reads the `after` cursor from the
   query string in addition to `limit`. Spec-compliant pagination would have
   re-fetched page 1 forever; the store already implemented the cursor.
   Added a store-level pagination test that walks all pages by `last_id`.
olyasir added a commit that referenced this pull request Jun 3, 2026
Only vla-ggml needs the GGML_BACKEND_DL build (Vulkan + HIP multi-backend).
Revert classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml and
translation-nmtcpp to their pre-PR registry baselines so they stay on the
cache-warm non-DL qvac-fabric (8828.0.2#1) and build fast.

Previously pinning all six consumers to the DL fabric (#2) made every
linux-x64 prebuild cold-compile GGML_CPU_ALL_VARIANTS from source; under
runner contention those builds blew past the 6h job limit and were
cancelled. Scoping DL to vla-ggml leaves only one consumer cold-building
the DL fabric, and matches where the multi-backend GPU need actually is.
simon-iribarren added a commit to simon-iribarren/qvac that referenced this pull request Jun 8, 2026
Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
  an unreadable lock), so a legitimate multi-minute cold start no longer loses
  its lock after 30s and spawns a duplicate runner/serve (tetherto#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
  a request racing close() can't silently re-add a consumer / spawn a runner (tetherto#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
  keep the record instead of dropping it — dropping stranded a live serve with
  no registry trace. We only reap once it answers as ours, or drop once its pid
  dies (tetherto#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
  reuse an auto-allocated serve on a different port, and distinct pins don't
  collide (tetherto#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
  reconnect, so diagnostics/external clients see the real serve after recovery (tetherto#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
  transparent if the SDK ever switches input shapes (tetherto#8).

Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
  liveness; document the long-lived-sentinel/wrapper pattern and fix the
  misleading "the script doesn't have to stay running" note (tetherto#2).
- Reconcile version wording: README/changelog now describe managed mode as
  unreleased (package is 0.1.0); docs-site integration page documents managed
  mode + the async overload (tetherto#7).

Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).
simon-iribarren added a commit that referenced this pull request Jun 10, 2026
* feat[api]: add managed mode to @qvac/ai-sdk-provider (QVAC-19900)

Add `mode: 'managed'` so the provider can synthesize an ephemeral
qvac.config.json from a model-constant list, spawn and supervise
`qvac serve` on a free port, and tear it down on host exit. External
mode is unchanged and stays synchronous; the managed supervisor is
lazily dynamic-imported so external-mode users pay no startup cost.

@qvac/cli becomes an optional peer dependency.

* fix: resolve @qvac/cli via main entry when its exports block package.json (QVAC-19900)

The published @qvac/cli ships a string `exports` field ("./dist/index.js"),
which makes the `./package.json` subpath non-resolvable
(ERR_PACKAGE_PATH_NOT_EXPORTED). Managed mode relied on resolving
`@qvac/cli/package.json` to locate the bin, so it would fail to find the CLI
on a clean install. Fall back to resolving the package main entry, which for
@qvac/cli is the same file as the `qvac` bin.

* doc: update ai-sdk provider agent setup after queue (QVAC-19900)

* QVAC-19900 feat[api]: per-model config for managed mode

Managed mode `models` now accepts spec objects ({ name, config, preload,
default }) alongside bare constant names, so callers can set per-model serve
options — notably `ctx_size` and `reasoning_budget` — that coding agents like
OpenCode require. The synthesized qvac.config.json carries the config block,
honors explicit `preload`/`default`, and validates names inside spec objects.

Exports the new `QvacManagedModel` type and documents per-model config plus a
managed-mode OpenCode example in the README.

* QVAC-19900 feat[api]: shared idle-reaped managed serve daemon

Rework managed mode from a per-provider supervisor into a shared,
self-cleaning serve daemon so it is robust standalone and usable by any
tool, not just a single session.

- Reuse via a fleet key (model set + per-model config + host) keyed in a
  cross-process registry under ~/.qvac/managed-serves/; createQvac attaches
  to a matching healthy serve instead of cold-starting a duplicate.
- A detached runner owns the qvac serve child and reaps it once no consumer
  process has been alive for serveIdleTimeout (default 5m). Liveness, not
  request traffic, is the signal, so it works for tools that hit baseURL
  directly (OpenCode/Cline/Aider).
- close() now detaches (deregisters the consumer) instead of killing; a
  shared serve survives until its last user is gone.
- Sweep only reaps dead/orphaned serves, never a healthy serve a live
  process owns (fixes a second session SIGKILLing a downloading serve).
- Respawn-on-failure: fetch re-resolves and retries once on ECONNREFUSED.
- reuse:false (or a pinned servePort) yields a private serve reaped as soon
  as its owner exits.

Refactor into serve-process.ts (spawn/health/stop), registry.ts,
fleet-key.ts, runner.ts; remove supervisor.ts and pid-tracker.ts. Add
reuse and serveIdleTimeout options. Rewrite tests and add reuse/idle-reap
end-to-end coverage; document the shared lifecycle in the README.

* QVAC-19900 fix: reject duplicate model names in managed mode

Each managed model maps to a single serve alias keyed by its name, so a
repeated name silently overwrote the earlier entry — and could drop its
`default: true`. Reject duplicates up front with DuplicateManagedModelError
instead of resolving them ambiguously. Addresses PR review feedback.

* QVAC-19900 fix[api]: address managed-mode self-review findings

- Per-instance consumer markers (<pid>.<rand>) so two providers in one
  process sharing a fleet key don't deregister each other on close (A).
- Restrict respawn retry to ECONNREFUSED so an in-flight completion is
  never blindly replayed on ECONNRESET/EPIPE (C).
- Health-check the recorded baseURL before SIGTERM-ing an orphaned serve,
  guarding against killing a recycled pid (D).
- Use dirname() instead of a posix-only regex for ephemeral config cleanup (E).
- Fold serveBinPath into the fleet key so distinct local builds don't share
  a serve (G).
- Export managed error classes + QvacManagedErrorCode for instanceof checks (H).
- Reject more than one explicit default: true (I).
- Deregister the consumer if resolveServe throws (F); drop dead
  firstConsumerPid runner param (J).

Tests: per-instance markers, health-gated orphan sweep (kills serving
orphan, spares non-serving stranger pid), fleet-key serveBinPath sensitivity,
multiple-default rejection. README updated.

* QVAC-19900 fix[api]: address managed-mode lifecycle review (round 2)

Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
  an unreadable lock), so a legitimate multi-minute cold start no longer loses
  its lock after 30s and spawns a duplicate runner/serve (#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
  a request racing close() can't silently re-add a consumer / spawn a runner (#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
  keep the record instead of dropping it — dropping stranded a live serve with
  no registry trace. We only reap once it answers as ours, or drop once its pid
  dies (#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
  reuse an auto-allocated serve on a different port, and distinct pins don't
  collide (#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
  reconnect, so diagnostics/external clients see the real serve after recovery (#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
  transparent if the SDK ever switches input shapes (#8).

Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
  liveness; document the long-lived-sentinel/wrapper pattern and fix the
  misleading "the script doesn't have to stay running" note (#2).
- Reconcile version wording: README/changelog now describe managed mode as
  unreleased (package is 0.1.0); docs-site integration page documents managed
  mode + the async overload (#7).

Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).

* docs: use canonical qvac.tether.io URL in ai-sdk-provider README

* QVAC-19900 feat[api]: public model catalog + catalog-id aliases in managed mode

Add `models.qvacCatalog`, a public models.dev-style catalog that maps
friendly ids (`qwen3.5-9b`) to the SDK constant the serve loads
(`QWEN3_5_9B_MULTIMODAL_Q4_K_M`), so the id a user picks from models.dev
resolves end-to-end with no translation layer in front of the serve.

Managed mode now accepts catalog ids as model names: the synthesized
serve config keys the alias by the friendly id while `model` resolves to
the underlying SDK constant, so the serve answers `qwen3.5-9b` directly.
Bare SDK constants keep working unchanged. A drift unit test fails CI if
any catalog constant disappears from the generated SDK catalog.

* QVAC-19900 feat[api]: process-group serve teardown + closeOnParentExit

Harden managed-mode lifecycle so a managed serve never leaks its `bare`
inference worker or outlives the process that owns it.

- Process-group teardown: spawn `qvac serve` detached (its own group) and,
  when stopServe must escalate past the grace window, SIGKILL the whole
  group. A plain SIGKILL of the serve pid never cascades to the grandchild
  bare worker, so previously a wedged serve orphaned the worker. The
  graceful SIGTERM is still sent to the serve process only, so a healthy
  serve orchestrates its own shutdown and releases the global worker lock
  (no stale lock left behind); the group SIGKILL is the wedged-path fallback.

- `closeOnParentExit` option: for a daemon-style host whose sole job is to
  keep a managed serve alive for a parent process (e.g. an editor/agent
  plugin). The provider watches its parent pid and, the moment the parent
  exits (on POSIX we are reparented to init, ppid → 1), closes itself —
  deregistering the consumer so the runner reaps the serve — and exits.
  Without it a hard-killed parent would leave a reparented host alive,
  keeping its consumer marker forever so the serve was never reaped.

Tests: a stubborn-grandchild fake serve proves group teardown reaps the
worker; `parentIsGone` unit-tests the parent-watch decision.

* QVAC-19900 fix: keep managed serve lifecycle correct under close() race and crash-respawn

- Undo the consumer re-registration when close() wins the race against an
  in-flight fetch retry: resolveServe re-adds the marker after close() removed
  it, which would keep the shared serve warm until the process exits.
- Preserve live consumer markers when sweepServes reaps a crashed/orphaned
  serve, so a respawned runner inherits the still-alive sessions instead of
  idle-reaping the fresh serve out from under them.
- docs: bump managed-mode ctx_size examples to 32768 for agent-sized prompts.

* QVAC-19900 fix: rename reresolve result to resolved for clarity in managed fetch

* QVAC-19900 mod: collapse redundant sync/async registry teardown helpers

removeConsumer/removeConsumerSync and removeRecord/removeRecordSync were a
confusing sync/async mirror: the async removeConsumer was only ever called right
after the sync one (a guaranteed no-op), and the removeRecord pair was really two
teardown semantics under near-identical names. Marker/record teardown is a single
unlink/rm, cheap enough to be synchronous everywhere — including process 'exit'
handlers where async can't run — so collapse each pair into one sync function.
No behaviour change; addresses review feedback on #2408.

* QVAC-19900 mod: trim verbose comments in managed registry

Tighten the sync-rationale comments on removeRecord/removeConsumer and drop a
stale, broken leftover comment above ensureDirSync. Keeps the non-obvious intent
(why sync, preserveConsumers semantics) without the narration.

* QVAC-19900 mod: drop unused DEFAULT_SERVE_BIN and ephemeralConfigName

Both were dead: DEFAULT_SERVE_BIN was never imported (serve-process spawns the
resolved CLI path verbatim) and ephemeralConfigName was an unused helper
(writeEphemeralConfig uses a fixed name inside an mkdtemp dir). Removing the
latter also drops the now-unused randomBytes import.
Zbig9000 added a commit to Zbig9000/qvac that referenced this pull request Jun 18, 2026
…r, registry-baseline note)

- README: correct the addon-resolution description -- loadAddon prefers the
  INSTALLED package (specifier) and falls back to the monorepo source tree, not
  the other way round; also drop the inaccurate "pairwise covering array" /
  "current tree" wording. (GustavoA1604)
- coload-smoke-ggml.yml: state explicitly in the header that the PR run
  validates the published REGISTRY BASELINE, not the PR's own diff (the PR's
  freshly-built change is guarded by the Phase-1 prebuild symbol gate). (ogad tetherto#2)
- verify-prebuild-symbols.mjs: drop the dead `bare_/napi_` filter in the export
  check -- it's unreachable, control only reaches it after isEngineSymbol matched
  an engine prefix, so a symbol can never also start with bare_/napi_. (ogad tetherto#4)
  Drop the trailing comma in the opts literal. (ogad trivia)
GustavoA1604 added a commit that referenced this pull request Jun 19, 2026
… + co-load harness) (#2548)

* infra[notask]: add multi-addon ggml co-load CI gates + harness

Catch the class of bug where native ggml addons pass single-addon CI but
crash when co-loaded into one process (the @qvac/tts-ggml@0.2.1 dlopen
SIGABRT) -- which only surfaced in SDK e2e, where the consumer loads ~10
ggml addons at bootstrap.

Phase 1 (cheap, deterministic; would have caught 0.2.1):
- scripts/verify-prebuild-symbols.mjs + .github/actions/verify-prebuild-symbols,
  wired into reusable-prebuilds.yml: fail a prebuild when an addon module has
  unresolvable engine symbols (ggml_/gguf_/llama_/whisper_/...) not provided by
  a co-located DT_NEEDED, and warn on leaked engine exports. No device needed.
- Replace the log-only Bare.on('unhandledRejection') swallow (the v0.2.1
  false-green) across 12 addon mobile runners with a beforeExit -> exit(1)
  handler so a dlopen failure can never report green.

Phase 2 (desktop co-load):
- packages/ggml-coload-smoke: require()s several @Qvac ggml addons into ONE
  Bare process; combination-matrix generator (all / per-stack / cross-stack /
  per-PR changed-addon focus).
- coload-smoke-ggml.yml runs the matrix on ggml-addon PRs.

Phase 3 (mobile + export hygiene):
- coload-smoke-mobile-ggml.yml: verified-gated Android Device Farm co-load,
  reusing the SDK consumer bundle via a new optional `plugins` subset input on
  test-android-sdk.yml (no-op for normal SDK e2e).
- diffusion-cpp: add symbols.map + --exclude-libs,ALL (it shipped neither and
  leaked ggml_* exports) -- the export hygiene the Phase 1 gate enforces.

Co-authored-by: Cursor <cursoragent@cursor.com>

* infra[notask]: run desktop co-load on github-hosted runner (fork-PR safe)

The pull_request event does not pass secrets to fork PRs, so secrets.PAT_TOKEN
was empty and checkout failed. Released @Qvac addons resolve from public npm
without auth, so run the desktop co-load on ubuntu-latest with the default
token instead of the self-hosted qvac runner + PAT.

* fix[skiplog]: export bare_register_module_v0 on macOS for diffusion-cpp

The new APPLE symbol-export block only exported _bare_get_module_name_v0,
which hid _bare_register_module_v0 and napi_register_module_v1. On darwin
the addon then failed to load with `dlsym(napi_register_module_v1): symbol
not found` (SIGABRT, exit 134), breaking run-integration-tests/test-darwin-*.
Export both Bare entrypoints, matching tts-onnx / transcription-parakeet /
tts-ggml.

* infra[skiplog]: address review on verify-prebuild-symbols

- enginePrefixes: add `sd` / `stable_diffusion` so the code default matches the
  documented default (header line 38) and the gate actually checks
  diffusion-cpp's stable-diffusion engine symbols. The word-boundary matching
  (name === p || startsWith(p + '_')) keeps `sd` from matching unrelated roots
  such as `sdl_`.
- symbolsOf: strip the leading underscore from Mach-O symbols so isEngineSymbol
  (and the bare_*/napi_* export allowlist) match on darwin/ios -- previously
  both checks were a silent no-op for every Mach-O binary.

Thanks @GustavoA1604.

* infra[skiplog]: simplify diffusion-cpp export hygiene to --exclude-libs,ALL

Match the 9-addon majority -- every other ggml addon relies on
--exclude-libs,ALL alone -- instead of the parakeet/tts-onnx symbols.map +
Apple -exported_symbol pattern. --exclude-libs,ALL alone hides the statically
linked ggml on Linux, which is the real gap diffusion-cpp had (it was the only
addon missing it). Drops the redundant symbols.map (version-script) and the
Apple -exported_symbol block; the latter also makes c77f0c9 moot, since that
fix only patched the over-tight Apple export list this removes.

* infra[skiplog]: address review on verify-prebuild-symbols (Mach-O scope + soname note)

- UND hard-fail is now ELF-only (Linux/Android). On Mach-O an unresolved
  engine symbol is reported as a WARNING, since neededOf can't enumerate
  Mach-O DT_NEEDED providers and Apple addons statically link the engine
  today -- this avoids a false-positive hard-fail should an addon ever ship
  co-located engine .dylibs on macOS. (review #3)
- Document the exportsBySoname basename-vs-soname latent fragility: the index
  is keyed by on-disk filename but DT_NEEDED holds sonames; they coincide for
  qvac's backend libs today. (review #4)

* test[skiplog]: add CI self-test proving the prebuild gate catches tts-ggml@0.2.1

Runs verify-prebuild-symbols.mjs against the published @qvac/tts-ggml@0.2.1
(the build #2502 reverted) and asserts it hard-fails, plus the good 0.2.0 and
asserts it passes -- straight off the npm artifacts, public-npm only so it runs
on fork PRs. Gives a shareable, durable demonstration that the gate catches the
Android dlopen regression, and guards the gate from silently rotting.

* infra[skiplog]: harden mobile co-load fork-PR handling + gate readelf guard

- coload-smoke-mobile-ggml.yml: the Device Farm job needs base-repo secrets,
  which a pull_request from a fork cannot read. Gate the run to workflow_dispatch
  or a `verified` PR from an in-repo branch, and drop the now-unnecessary
  PAT_TOKEN from the combos-only checkout -- fork PRs would otherwise fail at
  checkout (empty token) and again at the Device Farm step. Mirrors the desktop
  co-load fork-PR fix; fork PRs are covered by the desktop co-load.
- verify-prebuild-symbols.mjs: fail loudly (tooling error, exit 2) if readelf is
  unavailable while ELF binaries are present, instead of silently skipping
  DT_NEEDED provider resolution and over-flagging genuinely-resolved engine
  imports as unresolved (a false-positive hard fail).

* infra[skiplog]: address round-2 review (README resolution, dead filter, registry-baseline note)

- README: correct the addon-resolution description -- loadAddon prefers the
  INSTALLED package (specifier) and falls back to the monorepo source tree, not
  the other way round; also drop the inaccurate "pairwise covering array" /
  "current tree" wording. (GustavoA1604)
- coload-smoke-ggml.yml: state explicitly in the header that the PR run
  validates the published REGISTRY BASELINE, not the PR's own diff (the PR's
  freshly-built change is guarded by the Phase-1 prebuild symbol gate). (ogad #2)
- verify-prebuild-symbols.mjs: drop the dead `bare_/napi_` filter in the export
  check -- it's unreachable, control only reaches it after isEngineSymbol matched
  an engine prefix, so a symbol can never also start with bare_/napi_. (ogad #4)
  Drop the trailing comma in the opts literal. (ogad trivia)

* infra[skiplog]: add fleet dry-run job to gate self-test (de-risk fleet-wide activation)

Addresses the blocking review item (ogad #1): the gate runs inside
reusable-prebuilds.yml, which all 14 prebuild-*.yml callers reference by local
path (uses: ./...), so the UND hard-fail activates for every ggml/onnx addon on
its next publish the moment this merges -- not gradually. The new fleet-dryrun
job packs every currently-published addon, extracts its android-arm64 prebuild,
runs the gate, and asserts 0 hard-fails, proving no shipping addon is blocked on
merge. Public npm only (no secrets); packs `latest` so it also tracks the live
fleet on every gate-touching PR.

* infra[skiplog]: make fleet dry-run informational + step-summary (it caught a live tts-ggml regression)

The fleet dry-run found a REAL hard-fail: @qvac/tts-ggml@0.3.0 (current npm
`latest`, published today) re-introduced the exact 0.2.1 Android UND regression
(ggml_backend_is_cpu / ggml_get_type_traits_cpu unresolved, no DT_NEEDED
provider). 0.2.4 is clean -> 0.3.0 is a regression. The other 11 ggml/onnx
addons pass.

That is the gate doing its job, but it's an addon bug (tracked separately), not
a defect in this PR -- so the dry-run is now informational: it no longer fails
the workflow, instead it writes a per-addon result table (with versions) to the
step summary and emits warning annotations. The enforcing gate stays per-addon
in reusable-prebuilds.yml, where it would correctly block tts-ggml's next
publish until the UND is fixed, leaving the other addons unaffected.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: GustavoA1604 <54457676+GustavoA1604@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant