Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase by shreyaskarnik · Pull Request #16 · shreyaskarnik/argo

shreyaskarnik · 2026-05-07T18:49:09Z

Summary

Builds on the v0.38.0 candidate (143f390) to make word-level narration
alignment actually shippable.

fix(deps): package.json overrides dedupes
@huggingface/transformers so kokoro-js shares the top-level v4.2
install. The prior commit added v4.2 directly (for Whisper) on top of
kokoro-js's pinned v3.5 — npm installed both, two onnxruntime-node
natives ended up in one process, and the second one to run inference
segfaulted. Reproduced cleanly in isolation; fixed by single-ORT
dedupe.
feat(narration): narration.atWord(scene, target) returns
ms-until-spoken (or null when missed/missing) so demos can schedule
effects on words instead of even-paced beats.
demo polish: demos/showcase.demo.ts camera and voiceover scenes
now anchor every cue to its spoken word. Camera maps 1:1 (the
narration enumerates all seven effects by name); voiceover anchors
five named engines and lets the unnamed two fall in the gaps.

Why the segfault matters

Without the override, any downstream user enabling tts.transcribe: true (the v0.38 opt-in) hits the same crash on the first Kokoro
generate. The crash also fires through static imports alone — i.e. the
moment anything touches dist/index.js, the v4 chain loads eagerly and
the v3 chain breaks on first inference. So this is a release-blocker
for the feature 143f390 added.

Authoring gap surfaced

Whisper transcribes audio, not text. Phonetic spellings used to fix
Kokoro pronunciation propagate into transcripts, so "Kokoro" becomes
"cochro" and "OpenAI" comes out "opening eye". atWord() does an exact
(normalized) match by design — anchor words must come from the actual
transcript, not the manifest. Comment in showcase.demo.ts explains the
gap; a CLAUDE.md note belongs in a follow-up if we want to make this
official guidance.

Test plan

npm test — 649/649 pass
npx argo pipeline showcase end-to-end with tts.transcribe: true — full 163s export, no segfault
Bare-repro tests for both load orders (transformers→kokoro and
kokoro→transformers) succeed under the override
Inspect narration.transcript.json — word timings present for
all 11 scenes
Manual playback review: confirm camera/voiceover effects land on
their anchor words in the rendered video

Argo today knows scene boundaries (from narration.mark()) and per-clip durations (from the wav header), but doesn't know what word is being spoken at any given video time. Phoneme-level estimates from upstream TTS engines drift, so the source of truth is the rendered audio itself — transcribe it back with Whisper and read the per-word timestamps. Pattern borrowed from hyperframes' website-to-hyperframes pipeline, which renders TTS then runs whisper STT on the result for the same reason. Once Argo's renderComposition is the canonical bridge between recordings and hyperframes-style scenes, exposing word-level timestamps lets compositions sync to spoken content the same way hyperframes' own renderer does. Why now (downstream surfaces this unblocks): * Composition sync: GSAP tweens fire on specific spoken words * Karaoke / per-word captions in subtitles.ts * Surgical narration edits in the preview UI * Deterministic head-trim instead of estimated Architecture — two files for two consumers: .argo/<demo>/.scene-transcripts.json (private, scene-relative) Read during recording so demo scripts can call `narration.wordTiming('hero')` and schedule effects on words. Timestamps start near 0 within each scene since placement offsets aren't known until after alignment. .argo/<demo>/narration.transcript.json (public, recording-absolute) Documented artifact for post-pipeline consumers (subtitles, compositions, preview). Timestamps are over the full mixed `narration-aligned.wav` time base. Keyed by scene name with the same shape as the scene-relative file: { "version": 1, "model": "Xenova/whisper-base.en", "scenes": { "hero": [{ "text": "Your", "start": 0.07, "end": 0.33 }, ...] } } .argo/<demo>/clips/<sha>.<model>.transcript.json (per-clip cache) Same SHA as the audio clip so a clip cache hit rides along, but folds the Whisper model id into the filename so swapping models doesn't bust the (much more expensive) audio cache. Engine choice: `@huggingface/transformers` was already in the tree for Kokoro — Whisper rides on the same ONNX runtime, no new deps. The default `Xenova/whisper-base.en` model exports cross-attentions, which transformers.js requires for word-level timestamps; the `onnx-community/*` variants don't and fail at runtime with "Model outputs must contain cross attentions" (caught during smoke test, comment in transcribe.ts records this). ffmpeg resamples 24kHz Argo WAVs to 16kHz Whisper-format Float32 in memory. transformers.js' Node build lacks AudioContext so its `read_audio()` doesn't work; ffmpeg is already a hard dep for export and this adds ~50ms per clip vs. writing a JS resampler. Surfaces: config: `tts.transcribe?: boolean | { model?, language? }` — off by default, opt-in for v0.38.0 pipeline: per-clip transcribe runs after generateClips() and is cached; aggregate file written post-align with placement offsets folded in fixture: `narration.wordTiming(scene): WordTiming[]` returns scene-relative timestamps loaded from ARGO_TRANSCRIPT_PATH (set by record.ts) cache: per-clip transcript files reuse audio SHA, key off model id Best-effort: transcription failures warn but never fail the pipeline — demos without transcripts work exactly as before. Smoke-verified end to end: * One-clip transcribeWav: 13 words from a 7s clip in 4.3s (cold start, model load) — warm runs ~1s/clip * 11-clip integration: all per-clip transcripts written, second run cache-hits in 4ms (zero re-transcription, zero model load) * Aggregate JSON shape matches the design (scene-keyed, version, model, optional language) * 649 unit tests pass Roadmap (not in this commit): * v0.38.1 — word-level VTT in src/subtitles.ts; preview UI shows word chips per scene with click-to-seek * v0.38.2 — renderComposition injects window.__wordTiming[scene] so compositions can sync GSAP tweens to spoken words; default flips to on once accuracy is validated across engines * Future: Kokoro-native path (the model produces phoneme durations internally as part of non-autoregressive synthesis but kokoro-js doesn't expose them) — pure perf optimization, public artifact shape stays the same Memory note `project_word_level_stt.md` (updated): this lands the v0.38.0 candidate sketched there. Whisper-first chosen over Kokoro-native for universality (works for cloud engines too) and zero new deps. Hybrid path stays open for a v0.38.x optimization.

The word-level STT commit (143f390) added @huggingface/transformers ^4.2.0 as a direct dep so transcribe.ts can run Whisper. kokoro-js@1.2.1 pins its own ^3.5.1 internally, so npm installed two copies — and two copies of onnxruntime-node (1.24.3 alongside 1.21.0). Both register a native onnxruntime_binding.node in the same Node process, and whichever runs inference second segfaults on conflicting symbol tables. Symptom: argo pipeline showcase died at "▸ hero (generating...)" with exit 139 the moment a recording-side import touched dist/index.js (pipeline.ts → generate.ts → transcribe.ts pulled the v4 chain in eagerly at module init, before any transcribe call). Reproduced in isolation by loading transformers v4 then kokoro-js (exit 139) or the reverse order (exit 134, SIGABRT) — having both ORTs in-process is broken regardless of order. Fix: package.json `overrides` forces kokoro-js's transformers to dedupe to the top-level ^4.2.0. Single ORT (1.24.3) in the process tree. Verified: kokoro-js@1.2.1 was authored against transformers v3 but only its TextSplitterStream (used by tts.stream()) is missing in v4 — Argo's KokoroEngine only calls tts.generate(), so the production path is safe. The wrapper's stream() method (src/tts/engines/kokoro.ts) would break under the override; flag for follow-up if anyone ever wires streaming. Verified end-to-end: * npm test — 649/649 pass * Bare Kokoro generate — used to exit 139, now 0 * argo pipeline showcase with tts.transcribe: true — TTS + Whisper both run in one process; full 163s video exports cleanly

The 143f390 STT commit makes per-word timestamps available via narration.wordTiming(scene). Useful, but every consumer site ends up writing the same "find word, compute remaining ms, schedule effect" boilerplate. atWord(scene, target) returns ms-from-now until `target` is spoken, or null when the word is missing / already past / the transcript isn't loaded — callers fall back to whatever scheduling they were using before. const t = narration.atWord('camera', 'Spotlight'); if (t !== null) await page.waitForTimeout(t); spotlight(page, '#hero-button'); Matching is case-insensitive and strips trailing punctuation (Whisper keeps periods/commas attached). occurrence: 2 picks the second hit; afterMs skips earlier matches — useful when a word repeats. Showcase demo updated to anchor effects on spoken words instead of even-paced beats: * camera scene — 1:1 mapping, every cue named in narration: Spotlight → spotlight dim → dimAround focus → focusRing highlight → cursor focus zoom → zoomTo motion → motion-blur ring confetti → showConfetti * voiceover scene — five engines named in narration get word-precision dim cues: cochro → engine-kokoro (Whisper hears "Kokoro") hugging → engine-transformers opening → engine-openai (Whisper hears "OpenAI") 11 → engine-elevenlabs MLX → engine-mlx Gemini and Sarvam aren't named in narration; they fire in the gaps with sensible spacing so all eight cards still light up. Authoring gap surfaced (worth a CLAUDE.md note in a follow-up): anchor words must come from the actual transcript, not the manifest text. Phonetic spellings used to fix Kokoro pronunciation propagate into Whisper's output, so "Kokoro" becomes "cochro" and "OpenAI" the phrase "opening eye". Authors should peek at the transcript JSON before picking anchors. atWord intentionally does an exact (normalized) match — fuzzy/phonetic matching is a separate design decision. Showcase config opts in with `tts.transcribe: true`. Video re-exported under the new alignment.

Pass over the prior commit per /simplify review. * Drop atWord's `occurrence` and `afterMs` options — speculative surface, no callers exercise them. Add back when a real demo needs them, not before. * Read transcript words directly inside atWord instead of going through wordTiming(), which eagerly defensive-copies every word. The public wordTiming() keeps the copy for caller safety. * Extract a private `normalizeWord()` helper at module scope — target normalization moves out of the per-word loop, the regex only appears once. * Trim the JSDoc to the single non-obvious bit (Whisper-vs-text spelling). * Hoist showcase's `wait` / `cameraWait` lambdas into a single `waitForWord(scene, word, fb)` defined once at the top of the test. The two scene-baked variants were near-duplicates. * Drop the over-explanatory comments on each scene; keep only the Whisper-vs-text caveat (one line) and the unnamed-engine gap-fill note. * Remove the stale "falls back to even-paced beats" comment — the fallback is the per-call `fb` ms, not derived beats. No behaviour change. 649/649 tests pass.

shreyaskarnik added 4 commits May 7, 2026 09:27

shreyaskarnik merged commit 2e37846 into main May 7, 2026
4 checks passed

shreyaskarnik deleted the feat/word-level-stt branch May 7, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase#16

Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase#16
shreyaskarnik merged 4 commits into
mainfrom
feat/word-level-stt

shreyaskarnik commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyaskarnik commented May 7, 2026

Summary

Why the segfault matters

Authoring gap surfaced

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant