Fix Voxtral Realtime runner flush#18339
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18339
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Awaiting Approval, 2 Cancelled Jobs, 2 Unrelated FailuresAs of commit 579bb82 with merge base eb92cec ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
I realize now that "--delay-tokens" is a configurable model export variable, and as such we probably want to read this from exported model data instead of the tokenizer file. Moved to draft for this reason. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Voxtral Realtime example runner’s streaming flush() behavior to avoid duplicate trailing tokens by ending the stream via draining a finite padded audio tail (similar to vLLM), rather than switching into post-audio text-only decoding.
Changes:
- Split transcription config into offline vs. streaming configs and update CLI flagging accordingly.
- Add
delay_tokensas exported model constant metadata and use it to pad trailing silence during streaming flush. - Update runner/docs to reflect the new audio-conditioned flush path.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| examples/models/voxtral_realtime/voxtral_realtime_runner.h | Splits config types; adds delay_tokens_; updates streaming flush/decode signatures/docs. |
| examples/models/voxtral_realtime/voxtral_realtime_runner.cpp | Reads delay_tokens; removes text-only flush; pads silence tail and drains via normal streaming path. |
| examples/models/voxtral_realtime/main.cpp | Renames CLI token-limit flag to --offline_max_new_tokens; enforces it is offline-only; adds streaming config helper. |
| examples/models/voxtral_realtime/export_voxtral_rt.py | Exports delay_tokens as a constant method via metadata. |
| examples/models/voxtral_realtime/model.md | Updates documentation of streaming flush() semantics. |
| examples/models/voxtral_realtime/README.md | Updates CLI docs for the renamed offline-only max token flag. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int64_t pad_to = | ||
| (((remaining + step - 1) / step) + right_pad_audio_steps) * step + | ||
| right_lookahead; |
There was a problem hiding this comment.
In flush(), pad_to is computed using ceil(remaining/step). When the stream ends with < stft_right_lookahead_ samples after the last full step (i.e., audio_len = N*step + r with 0<r<right_lookahead), remaining becomes step+r, so ceil(remaining/step) evaluates to 2 and you end up padding an extra full step of silence. That adds an unintended extra decode step (beyond the last real audio step + lookahead + delay) and can reintroduce spurious trailing tokens.
Consider computing the number of pending audio steps based on (remaining - right_lookahead) (clamped) and rounding that up to step, ensuring at least 1 step, then adding delay_tokens_ steps on top. This avoids over-padding by a full step when only the lookahead is partially missing.
| int64_t pad_to = | |
| (((remaining + step - 1) / step) + right_pad_audio_steps) * step + | |
| right_lookahead; | |
| // Compute the number of pending audio steps excluding the right lookahead, | |
| // ensuring at least one step, then add the model's delay steps. | |
| const int64_t pending_audio = | |
| std::max<int64_t>(remaining - right_lookahead, 0); | |
| int64_t pending_audio_steps = | |
| (pending_audio + step - 1) / step; // ceil(pending_audio / step) | |
| if (pending_audio_steps < 1) { | |
| pending_audio_steps = 1; | |
| } | |
| const int64_t total_steps = pending_audio_steps + right_pad_audio_steps; | |
| const int64_t pad_to = total_steps * step + right_lookahead; |
|
Thanks @mattjcly for the contribution. The bug fix looks correct |
|
@pytorchbot cherry-pick --onto release/1.2 -c release |
### Summary `VoxtralRealtimeRunner` was outputting excessive duplicate tokens/gibberish on stream flush. In an audio file where I say "The weather is clear today" run like ``` voxtral_realtime_runner \ --model_path model.pte \ --tokenizer_path tekken.json \ --preprocessor_path preprocessor.pte \ --streaming \ --audio_path audio.wav ``` I would get output: `The weather is clear todayoday.</s>` I also experienced this with periods and many other circumstances with repeating tokens at the end of the stream. Upon investigation into vLLM's (Mistrals recommended inferencing runner for [Voxtral Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602#vllm-recommended)), I observed that vLLM finishes the stream by closing the streaming input and draining model-defined right-padding audio, whereas ExecuTorch `flush()` finished by switching into post-audio text-only decoding after audio ended. See vLLM ref: https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270. Therefore, apply similar logic here by converting the model-defined transcription delay into a finite number of trailing silent streaming steps to properly conclude the stream. After this, the same command outputs: ``` The weather is clear today. ``` As I would expect. No `</s>` b/c like vLLM, the stream ends by naturally draining the padded audio tail and letting the model emit whatever final delayed text it wants. ### Test plan Tested with above example and a few other audio files to observe the behavior improvement and lack of giberrish/incorrect end of stream. (cherry picked from commit 4c51461)
Cherry picking #18339The cherry pick PR is at #18387 The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Summary
VoxtralRealtimeRunnerwas outputting excessive duplicate tokens/gibberish on stream flush. In an audio file where I say "The weather is clear today" run likeI would get output:
The weather is clear todayoday.</s>I also experienced this with periods and many other circumstances with repeating tokens at the end of the stream.
Upon investigation into vLLM's (Mistrals recommended inferencing runner for Voxtral Realtime), I observed that vLLM finishes the stream by closing the streaming input and draining model-defined right-padding audio, whereas ExecuTorch
flush()finished by switching into post-audio text-only decoding after audio ended. See vLLM ref: https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270.Therefore, apply similar logic here by converting the model-defined transcription delay into a finite number of trailing silent streaming steps to properly conclude the stream.
After this, the same command outputs:
As I would expect. No
</s>b/c like vLLM, the stream ends by naturally draining the padded audio tail and letting the model emit whatever final delayed text it wants.Test plan
Tested with above example and a few other audio files to observe the behavior improvement and lack of giberrish/incorrect end of stream.
cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng