Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase#16
Merged
Conversation
Argo today knows scene boundaries (from narration.mark()) and per-clip
durations (from the wav header), but doesn't know what word is being
spoken at any given video time. Phoneme-level estimates from upstream
TTS engines drift, so the source of truth is the rendered audio
itself — transcribe it back with Whisper and read the per-word
timestamps.
Pattern borrowed from hyperframes' website-to-hyperframes pipeline,
which renders TTS then runs whisper STT on the result for the same
reason. Once Argo's renderComposition is the canonical bridge between
recordings and hyperframes-style scenes, exposing word-level timestamps
lets compositions sync to spoken content the same way hyperframes' own
renderer does.
Why now (downstream surfaces this unblocks):
* Composition sync: GSAP tweens fire on specific spoken words
* Karaoke / per-word captions in subtitles.ts
* Surgical narration edits in the preview UI
* Deterministic head-trim instead of estimated
Architecture — two files for two consumers:
.argo/<demo>/.scene-transcripts.json (private, scene-relative)
Read during recording so demo scripts can call
`narration.wordTiming('hero')` and schedule effects on words.
Timestamps start near 0 within each scene since placement
offsets aren't known until after alignment.
.argo/<demo>/narration.transcript.json (public, recording-absolute)
Documented artifact for post-pipeline consumers (subtitles,
compositions, preview). Timestamps are over the full mixed
`narration-aligned.wav` time base. Keyed by scene name with the
same shape as the scene-relative file:
{
"version": 1,
"model": "Xenova/whisper-base.en",
"scenes": {
"hero": [{ "text": "Your", "start": 0.07, "end": 0.33 }, ...]
}
}
.argo/<demo>/clips/<sha>.<model>.transcript.json (per-clip cache)
Same SHA as the audio clip so a clip cache hit rides along, but
folds the Whisper model id into the filename so swapping models
doesn't bust the (much more expensive) audio cache.
Engine choice: `@huggingface/transformers` was already in the tree
for Kokoro — Whisper rides on the same ONNX runtime, no new deps. The
default `Xenova/whisper-base.en` model exports cross-attentions, which
transformers.js requires for word-level timestamps; the
`onnx-community/*` variants don't and fail at runtime with
"Model outputs must contain cross attentions" (caught during smoke
test, comment in transcribe.ts records this).
ffmpeg resamples 24kHz Argo WAVs to 16kHz Whisper-format Float32 in
memory. transformers.js' Node build lacks AudioContext so its
`read_audio()` doesn't work; ffmpeg is already a hard dep for export
and this adds ~50ms per clip vs. writing a JS resampler.
Surfaces:
config: `tts.transcribe?: boolean | { model?, language? }` — off
by default, opt-in for v0.38.0
pipeline: per-clip transcribe runs after generateClips() and is
cached; aggregate file written post-align with placement
offsets folded in
fixture: `narration.wordTiming(scene): WordTiming[]` returns
scene-relative timestamps loaded from
ARGO_TRANSCRIPT_PATH (set by record.ts)
cache: per-clip transcript files reuse audio SHA, key off model id
Best-effort: transcription failures warn but never fail the pipeline —
demos without transcripts work exactly as before.
Smoke-verified end to end:
* One-clip transcribeWav: 13 words from a 7s clip in 4.3s
(cold start, model load) — warm runs ~1s/clip
* 11-clip integration: all per-clip transcripts written, second run
cache-hits in 4ms (zero re-transcription, zero model load)
* Aggregate JSON shape matches the design (scene-keyed, version,
model, optional language)
* 649 unit tests pass
Roadmap (not in this commit):
* v0.38.1 — word-level VTT in src/subtitles.ts; preview UI shows
word chips per scene with click-to-seek
* v0.38.2 — renderComposition injects window.__wordTiming[scene]
so compositions can sync GSAP tweens to spoken words; default
flips to on once accuracy is validated across engines
* Future: Kokoro-native path (the model produces phoneme durations
internally as part of non-autoregressive synthesis but kokoro-js
doesn't expose them) — pure perf optimization, public artifact
shape stays the same
Memory note `project_word_level_stt.md` (updated): this lands the
v0.38.0 candidate sketched there. Whisper-first chosen over
Kokoro-native for universality (works for cloud engines too) and
zero new deps. Hybrid path stays open for a v0.38.x optimization.
The word-level STT commit (143f390) added @huggingface/transformers ^4.2.0 as a direct dep so transcribe.ts can run Whisper. kokoro-js@1.2.1 pins its own ^3.5.1 internally, so npm installed two copies — and two copies of onnxruntime-node (1.24.3 alongside 1.21.0). Both register a native onnxruntime_binding.node in the same Node process, and whichever runs inference second segfaults on conflicting symbol tables. Symptom: argo pipeline showcase died at "▸ hero (generating...)" with exit 139 the moment a recording-side import touched dist/index.js (pipeline.ts → generate.ts → transcribe.ts pulled the v4 chain in eagerly at module init, before any transcribe call). Reproduced in isolation by loading transformers v4 then kokoro-js (exit 139) or the reverse order (exit 134, SIGABRT) — having both ORTs in-process is broken regardless of order. Fix: package.json `overrides` forces kokoro-js's transformers to dedupe to the top-level ^4.2.0. Single ORT (1.24.3) in the process tree. Verified: kokoro-js@1.2.1 was authored against transformers v3 but only its TextSplitterStream (used by tts.stream()) is missing in v4 — Argo's KokoroEngine only calls tts.generate(), so the production path is safe. The wrapper's stream() method (src/tts/engines/kokoro.ts) would break under the override; flag for follow-up if anyone ever wires streaming. Verified end-to-end: * npm test — 649/649 pass * Bare Kokoro generate — used to exit 139, now 0 * argo pipeline showcase with tts.transcribe: true — TTS + Whisper both run in one process; full 163s video exports cleanly
The 143f390 STT commit makes per-word timestamps available via narration.wordTiming(scene). Useful, but every consumer site ends up writing the same "find word, compute remaining ms, schedule effect" boilerplate. atWord(scene, target) returns ms-from-now until `target` is spoken, or null when the word is missing / already past / the transcript isn't loaded — callers fall back to whatever scheduling they were using before. const t = narration.atWord('camera', 'Spotlight'); if (t !== null) await page.waitForTimeout(t); spotlight(page, '#hero-button'); Matching is case-insensitive and strips trailing punctuation (Whisper keeps periods/commas attached). occurrence: 2 picks the second hit; afterMs skips earlier matches — useful when a word repeats. Showcase demo updated to anchor effects on spoken words instead of even-paced beats: * camera scene — 1:1 mapping, every cue named in narration: Spotlight → spotlight dim → dimAround focus → focusRing highlight → cursor focus zoom → zoomTo motion → motion-blur ring confetti → showConfetti * voiceover scene — five engines named in narration get word-precision dim cues: cochro → engine-kokoro (Whisper hears "Kokoro") hugging → engine-transformers opening → engine-openai (Whisper hears "OpenAI") 11 → engine-elevenlabs MLX → engine-mlx Gemini and Sarvam aren't named in narration; they fire in the gaps with sensible spacing so all eight cards still light up. Authoring gap surfaced (worth a CLAUDE.md note in a follow-up): anchor words must come from the actual transcript, not the manifest text. Phonetic spellings used to fix Kokoro pronunciation propagate into Whisper's output, so "Kokoro" becomes "cochro" and "OpenAI" the phrase "opening eye". Authors should peek at the transcript JSON before picking anchors. atWord intentionally does an exact (normalized) match — fuzzy/phonetic matching is a separate design decision. Showcase config opts in with `tts.transcribe: true`. Video re-exported under the new alignment.
Pass over the prior commit per /simplify review.
* Drop atWord's `occurrence` and `afterMs` options — speculative
surface, no callers exercise them. Add back when a real demo needs
them, not before.
* Read transcript words directly inside atWord instead of going
through wordTiming(), which eagerly defensive-copies every word.
The public wordTiming() keeps the copy for caller safety.
* Extract a private `normalizeWord()` helper at module scope —
target normalization moves out of the per-word loop, the regex
only appears once.
* Trim the JSDoc to the single non-obvious bit (Whisper-vs-text
spelling).
* Hoist showcase's `wait` / `cameraWait` lambdas into a single
`waitForWord(scene, word, fb)` defined once at the top of the
test. The two scene-baked variants were near-duplicates.
* Drop the over-explanatory comments on each scene; keep only the
Whisper-vs-text caveat (one line) and the unnamed-engine gap-fill
note.
* Remove the stale "falls back to even-paced beats" comment — the
fallback is the per-call `fb` ms, not derived beats.
No behaviour change. 649/649 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on the v0.38.0 candidate (143f390) to make word-level narration
alignment actually shippable.
overridesdedupes@huggingface/transformersso kokoro-js shares the top-level v4.2install. The prior commit added v4.2 directly (for Whisper) on top of
kokoro-js's pinned v3.5 — npm installed both, two
onnxruntime-nodenatives ended up in one process, and the second one to run inference
segfaulted. Reproduced cleanly in isolation; fixed by single-ORT
dedupe.
narration.atWord(scene, target)returnsms-until-spoken (or null when missed/missing) so demos can schedule
effects on words instead of even-paced beats.
demos/showcase.demo.tscamera and voiceover scenesnow anchor every cue to its spoken word. Camera maps 1:1 (the
narration enumerates all seven effects by name); voiceover anchors
five named engines and lets the unnamed two fall in the gaps.
Why the segfault matters
Without the override, any downstream user enabling
tts.transcribe: true(the v0.38 opt-in) hits the same crash on the first Kokorogenerate. The crash also fires through static imports alone — i.e. the
moment anything touches
dist/index.js, the v4 chain loads eagerly andthe v3 chain breaks on first inference. So this is a release-blocker
for the feature 143f390 added.
Authoring gap surfaced
Whisper transcribes audio, not text. Phonetic spellings used to fix
Kokoro pronunciation propagate into transcripts, so "Kokoro" becomes
"cochro" and "OpenAI" comes out "opening eye".
atWord()does an exact(normalized) match by design — anchor words must come from the actual
transcript, not the manifest. Comment in showcase.demo.ts explains the
gap; a CLAUDE.md note belongs in a follow-up if we want to make this
official guidance.
Test plan
npm test— 649/649 passnpx argo pipeline showcaseend-to-end withtts.transcribe: true— full 163s export, no segfaultkokoro→transformers) succeed under the override
narration.transcript.json— word timings present forall 11 scenes
their anchor words in the rendered video