Skip to content

Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase#16

Merged
shreyaskarnik merged 4 commits into
mainfrom
feat/word-level-stt
May 7, 2026
Merged

Word-level Whisper STT: atWord() helper, dedupe deps, word-aligned showcase#16
shreyaskarnik merged 4 commits into
mainfrom
feat/word-level-stt

Conversation

@shreyaskarnik
Copy link
Copy Markdown
Owner

Summary

Builds on the v0.38.0 candidate (143f390) to make word-level narration
alignment actually shippable.

  • fix(deps): package.json overrides dedupes
    @huggingface/transformers so kokoro-js shares the top-level v4.2
    install. The prior commit added v4.2 directly (for Whisper) on top of
    kokoro-js's pinned v3.5 — npm installed both, two onnxruntime-node
    natives ended up in one process, and the second one to run inference
    segfaulted. Reproduced cleanly in isolation; fixed by single-ORT
    dedupe.
  • feat(narration): narration.atWord(scene, target) returns
    ms-until-spoken (or null when missed/missing) so demos can schedule
    effects on words instead of even-paced beats.
  • demo polish: demos/showcase.demo.ts camera and voiceover scenes
    now anchor every cue to its spoken word. Camera maps 1:1 (the
    narration enumerates all seven effects by name); voiceover anchors
    five named engines and lets the unnamed two fall in the gaps.

Why the segfault matters

Without the override, any downstream user enabling tts.transcribe: true (the v0.38 opt-in) hits the same crash on the first Kokoro
generate. The crash also fires through static imports alone — i.e. the
moment anything touches dist/index.js, the v4 chain loads eagerly and
the v3 chain breaks on first inference. So this is a release-blocker
for the feature 143f390 added.

Authoring gap surfaced

Whisper transcribes audio, not text. Phonetic spellings used to fix
Kokoro pronunciation propagate into transcripts, so "Kokoro" becomes
"cochro" and "OpenAI" comes out "opening eye". atWord() does an exact
(normalized) match by design — anchor words must come from the actual
transcript, not the manifest. Comment in showcase.demo.ts explains the
gap; a CLAUDE.md note belongs in a follow-up if we want to make this
official guidance.

Test plan

  • npm test — 649/649 pass
  • npx argo pipeline showcase end-to-end with tts.transcribe: true — full 163s export, no segfault
  • Bare-repro tests for both load orders (transformers→kokoro and
    kokoro→transformers) succeed under the override
  • Inspect narration.transcript.json — word timings present for
    all 11 scenes
  • Manual playback review: confirm camera/voiceover effects land on
    their anchor words in the rendered video

Argo today knows scene boundaries (from narration.mark()) and per-clip
durations (from the wav header), but doesn't know what word is being
spoken at any given video time. Phoneme-level estimates from upstream
TTS engines drift, so the source of truth is the rendered audio
itself — transcribe it back with Whisper and read the per-word
timestamps.

Pattern borrowed from hyperframes' website-to-hyperframes pipeline,
which renders TTS then runs whisper STT on the result for the same
reason. Once Argo's renderComposition is the canonical bridge between
recordings and hyperframes-style scenes, exposing word-level timestamps
lets compositions sync to spoken content the same way hyperframes' own
renderer does.

Why now (downstream surfaces this unblocks):
  * Composition sync: GSAP tweens fire on specific spoken words
  * Karaoke / per-word captions in subtitles.ts
  * Surgical narration edits in the preview UI
  * Deterministic head-trim instead of estimated

Architecture — two files for two consumers:

  .argo/<demo>/.scene-transcripts.json      (private, scene-relative)
    Read during recording so demo scripts can call
    `narration.wordTiming('hero')` and schedule effects on words.
    Timestamps start near 0 within each scene since placement
    offsets aren't known until after alignment.

  .argo/<demo>/narration.transcript.json    (public, recording-absolute)
    Documented artifact for post-pipeline consumers (subtitles,
    compositions, preview). Timestamps are over the full mixed
    `narration-aligned.wav` time base. Keyed by scene name with the
    same shape as the scene-relative file:

      {
        "version": 1,
        "model": "Xenova/whisper-base.en",
        "scenes": {
          "hero": [{ "text": "Your", "start": 0.07, "end": 0.33 }, ...]
        }
      }

  .argo/<demo>/clips/<sha>.<model>.transcript.json   (per-clip cache)
    Same SHA as the audio clip so a clip cache hit rides along, but
    folds the Whisper model id into the filename so swapping models
    doesn't bust the (much more expensive) audio cache.

Engine choice: `@huggingface/transformers` was already in the tree
for Kokoro — Whisper rides on the same ONNX runtime, no new deps. The
default `Xenova/whisper-base.en` model exports cross-attentions, which
transformers.js requires for word-level timestamps; the
`onnx-community/*` variants don't and fail at runtime with
"Model outputs must contain cross attentions" (caught during smoke
test, comment in transcribe.ts records this).

ffmpeg resamples 24kHz Argo WAVs to 16kHz Whisper-format Float32 in
memory. transformers.js' Node build lacks AudioContext so its
`read_audio()` doesn't work; ffmpeg is already a hard dep for export
and this adds ~50ms per clip vs. writing a JS resampler.

Surfaces:

  config:    `tts.transcribe?: boolean | { model?, language? }` — off
             by default, opt-in for v0.38.0
  pipeline:  per-clip transcribe runs after generateClips() and is
             cached; aggregate file written post-align with placement
             offsets folded in
  fixture:   `narration.wordTiming(scene): WordTiming[]` returns
             scene-relative timestamps loaded from
             ARGO_TRANSCRIPT_PATH (set by record.ts)
  cache:     per-clip transcript files reuse audio SHA, key off model id

Best-effort: transcription failures warn but never fail the pipeline —
demos without transcripts work exactly as before.

Smoke-verified end to end:
  * One-clip transcribeWav: 13 words from a 7s clip in 4.3s
    (cold start, model load) — warm runs ~1s/clip
  * 11-clip integration: all per-clip transcripts written, second run
    cache-hits in 4ms (zero re-transcription, zero model load)
  * Aggregate JSON shape matches the design (scene-keyed, version,
    model, optional language)
  * 649 unit tests pass

Roadmap (not in this commit):
  * v0.38.1 — word-level VTT in src/subtitles.ts; preview UI shows
    word chips per scene with click-to-seek
  * v0.38.2 — renderComposition injects window.__wordTiming[scene]
    so compositions can sync GSAP tweens to spoken words; default
    flips to on once accuracy is validated across engines
  * Future: Kokoro-native path (the model produces phoneme durations
    internally as part of non-autoregressive synthesis but kokoro-js
    doesn't expose them) — pure perf optimization, public artifact
    shape stays the same

Memory note `project_word_level_stt.md` (updated): this lands the
v0.38.0 candidate sketched there. Whisper-first chosen over
Kokoro-native for universality (works for cloud engines too) and
zero new deps. Hybrid path stays open for a v0.38.x optimization.
The word-level STT commit (143f390) added @huggingface/transformers
^4.2.0 as a direct dep so transcribe.ts can run Whisper. kokoro-js@1.2.1
pins its own ^3.5.1 internally, so npm installed two copies — and two
copies of onnxruntime-node (1.24.3 alongside 1.21.0). Both register a
native onnxruntime_binding.node in the same Node process, and whichever
runs inference second segfaults on conflicting symbol tables.

Symptom: argo pipeline showcase died at "▸ hero (generating...)" with
exit 139 the moment a recording-side import touched dist/index.js
(pipeline.ts → generate.ts → transcribe.ts pulled the v4 chain in
eagerly at module init, before any transcribe call). Reproduced in
isolation by loading transformers v4 then kokoro-js (exit 139) or the
reverse order (exit 134, SIGABRT) — having both ORTs in-process is
broken regardless of order.

Fix: package.json `overrides` forces kokoro-js's transformers to dedupe
to the top-level ^4.2.0. Single ORT (1.24.3) in the process tree.
Verified: kokoro-js@1.2.1 was authored against transformers v3 but only
its TextSplitterStream (used by tts.stream()) is missing in v4 — Argo's
KokoroEngine only calls tts.generate(), so the production path is safe.
The wrapper's stream() method (src/tts/engines/kokoro.ts) would break
under the override; flag for follow-up if anyone ever wires streaming.

Verified end-to-end:
  * npm test — 649/649 pass
  * Bare Kokoro generate — used to exit 139, now 0
  * argo pipeline showcase with tts.transcribe: true — TTS + Whisper
    both run in one process; full 163s video exports cleanly
The 143f390 STT commit makes per-word timestamps available via
narration.wordTiming(scene). Useful, but every consumer site ends up
writing the same "find word, compute remaining ms, schedule effect"
boilerplate. atWord(scene, target) returns ms-from-now until `target`
is spoken, or null when the word is missing / already past / the
transcript isn't loaded — callers fall back to whatever scheduling
they were using before.

  const t = narration.atWord('camera', 'Spotlight');
  if (t !== null) await page.waitForTimeout(t);
  spotlight(page, '#hero-button');

Matching is case-insensitive and strips trailing punctuation (Whisper
keeps periods/commas attached). occurrence: 2 picks the second hit;
afterMs skips earlier matches — useful when a word repeats.

Showcase demo updated to anchor effects on spoken words instead of
even-paced beats:

  * camera scene — 1:1 mapping, every cue named in narration:
      Spotlight  → spotlight     dim       → dimAround
      focus      → focusRing     highlight → cursor focus
      zoom       → zoomTo        motion    → motion-blur ring
      confetti   → showConfetti

  * voiceover scene — five engines named in narration get
    word-precision dim cues:
      cochro   → engine-kokoro          (Whisper hears "Kokoro")
      hugging  → engine-transformers
      opening  → engine-openai          (Whisper hears "OpenAI")
      11       → engine-elevenlabs
      MLX      → engine-mlx
    Gemini and Sarvam aren't named in narration; they fire in the
    gaps with sensible spacing so all eight cards still light up.

Authoring gap surfaced (worth a CLAUDE.md note in a follow-up):
anchor words must come from the actual transcript, not the manifest
text. Phonetic spellings used to fix Kokoro pronunciation propagate
into Whisper's output, so "Kokoro" becomes "cochro" and "OpenAI" the
phrase "opening eye". Authors should peek at the transcript JSON
before picking anchors. atWord intentionally does an exact (normalized)
match — fuzzy/phonetic matching is a separate design decision.

Showcase config opts in with `tts.transcribe: true`. Video re-exported
under the new alignment.
Pass over the prior commit per /simplify review.

  * Drop atWord's `occurrence` and `afterMs` options — speculative
    surface, no callers exercise them. Add back when a real demo needs
    them, not before.
  * Read transcript words directly inside atWord instead of going
    through wordTiming(), which eagerly defensive-copies every word.
    The public wordTiming() keeps the copy for caller safety.
  * Extract a private `normalizeWord()` helper at module scope —
    target normalization moves out of the per-word loop, the regex
    only appears once.
  * Trim the JSDoc to the single non-obvious bit (Whisper-vs-text
    spelling).
  * Hoist showcase's `wait` / `cameraWait` lambdas into a single
    `waitForWord(scene, word, fb)` defined once at the top of the
    test. The two scene-baked variants were near-duplicates.
  * Drop the over-explanatory comments on each scene; keep only the
    Whisper-vs-text caveat (one line) and the unnamed-engine gap-fill
    note.
  * Remove the stale "falls back to even-paced beats" comment — the
    fallback is the per-call `fb` ms, not derived beats.

No behaviour change. 649/649 tests pass.
@shreyaskarnik shreyaskarnik merged commit 2e37846 into main May 7, 2026
4 checks passed
@shreyaskarnik shreyaskarnik deleted the feat/word-level-stt branch May 7, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant