Fix Voxtral Realtime runner flush by mattjcly · Pull Request #18339 · pytorch/executorch

mattjcly · 2026-03-19T20:08:07Z

Summary

VoxtralRealtimeRunner was outputting excessive duplicate tokens/gibberish on stream flush. In an audio file where I say "The weather is clear today" run like

voxtral_realtime_runner \
--model_path model.pte \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--streaming \
--audio_path audio.wav

I would get output: The weather is clear todayoday.</s>

I also experienced this with periods and many other circumstances with repeating tokens at the end of the stream.

Upon investigation into vLLM's (Mistrals recommended inferencing runner for Voxtral Realtime), I observed that vLLM finishes the stream by closing the streaming input and draining model-defined right-padding audio, whereas ExecuTorch flush() finished by switching into post-audio text-only decoding after audio ended. See vLLM ref: https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270.

Therefore, apply similar logic here by converting the model-defined transcription delay into a finite number of trailing silent streaming steps to properly conclude the stream.

After this, the same command outputs:

The weather is clear today.

As I would expect. No </s> b/c like vLLM, the stream ends by naturally draining the padded audio tail and letting the model emit whatever final delayed text it wants.

Test plan

Tested with above example and a few other audio files to observe the behavior improvement and lack of giberrish/incorrect end of stream.

cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng

pytorch-bot · 2026-03-19T20:08:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18339

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 2 Cancelled Jobs, 2 Unrelated Failures

As of commit 579bb82 with merge base eb92cec ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

Claude Code (gh)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-19T20:08:49Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mattjcly · 2026-03-19T21:39:00Z

I realize now that "--delay-tokens" is a configurable model export variable, and as such we probably want to read this from exported model data instead of the tokenizer file. Moved to draft for this reason.

Copilot

Pull request overview

This PR updates the Voxtral Realtime example runner’s streaming flush() behavior to avoid duplicate trailing tokens by ending the stream via draining a finite padded audio tail (similar to vLLM), rather than switching into post-audio text-only decoding.

Changes:

Split transcription config into offline vs. streaming configs and update CLI flagging accordingly.
Add delay_tokens as exported model constant metadata and use it to pad trailing silence during streaming flush.
Update runner/docs to reflect the new audio-conditioned flush path.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
examples/models/voxtral_realtime/voxtral_realtime_runner.h	Splits config types; adds `delay_tokens_`; updates streaming flush/decode signatures/docs.
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp	Reads `delay_tokens`; removes text-only flush; pads silence tail and drains via normal streaming path.
examples/models/voxtral_realtime/main.cpp	Renames CLI token-limit flag to `--offline_max_new_tokens`; enforces it is offline-only; adds streaming config helper.
examples/models/voxtral_realtime/export_voxtral_rt.py	Exports `delay_tokens` as a constant method via `metadata`.
examples/models/voxtral_realtime/model.md	Updates documentation of streaming `flush()` semantics.
examples/models/voxtral_realtime/README.md	Updates CLI docs for the renamed offline-only max token flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T23:05:09Z

+    int64_t pad_to =
+        (((remaining + step - 1) / step) + right_pad_audio_steps) * step +
+        right_lookahead;


In flush(), pad_to is computed using ceil(remaining/step). When the stream ends with < stft_right_lookahead_ samples after the last full step (i.e., audio_len = N*step + r with 0<r<right_lookahead), remaining becomes step+r, so ceil(remaining/step) evaluates to 2 and you end up padding an extra full step of silence. That adds an unintended extra decode step (beyond the last real audio step + lookahead + delay) and can reintroduce spurious trailing tokens.

Consider computing the number of pending audio steps based on (remaining - right_lookahead) (clamped) and rounding that up to step, ensuring at least 1 step, then adding delay_tokens_ steps on top. This avoids over-padding by a full step when only the lookahead is partially missing.

Suggested change

int64_t pad_to =

(((remaining + step - 1) / step) + right_pad_audio_steps) * step +

right_lookahead;

// Compute the number of pending audio steps excluding the right lookahead,

// ensuring at least one step, then add the model's delay steps.

const int64_t pending_audio =

std::max<int64_t>(remaining - right_lookahead, 0);

int64_t pending_audio_steps =

(pending_audio + step - 1) / step; // ceil(pending_audio / step)

if (pending_audio_steps < 1) {

pending_audio_steps = 1;

}

const int64_t total_steps = pending_audio_steps + right_pad_audio_steps;

const int64_t pad_to = total_steps * step + right_lookahead;

mergennachin · 2026-03-20T15:13:40Z

Thanks @mattjcly for the contribution. The bug fix looks correct

manuelcandales · 2026-03-20T21:32:11Z

@pytorchbot cherry-pick --onto release/1.2 -c release

### Summary `VoxtralRealtimeRunner` was outputting excessive duplicate tokens/gibberish on stream flush. In an audio file where I say "The weather is clear today" run like ``` voxtral_realtime_runner \ --model_path model.pte \ --tokenizer_path tekken.json \ --preprocessor_path preprocessor.pte \ --streaming \ --audio_path audio.wav ``` I would get output: `The weather is clear todayoday.</s>` I also experienced this with periods and many other circumstances with repeating tokens at the end of the stream. Upon investigation into vLLM's (Mistrals recommended inferencing runner for [Voxtral Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602#vllm-recommended)), I observed that vLLM finishes the stream by closing the streaming input and draining model-defined right-padding audio, whereas ExecuTorch `flush()` finished by switching into post-audio text-only decoding after audio ended. See vLLM ref: https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270. Therefore, apply similar logic here by converting the model-defined transcription delay into a finite number of trailing silent streaming steps to properly conclude the stream. After this, the same command outputs: ``` The weather is clear today. ``` As I would expect. No `</s>` b/c like vLLM, the stream ends by naturally draining the padded audio tail and letting the model emit whatever final delayed text it wants. ### Test plan Tested with above example and a few other audio files to observe the behavior improvement and lack of giberrish/incorrect end of stream. (cherry picked from commit 4c51461)

pytorchbot · 2026-03-20T21:34:43Z

Cherry picking #18339

The cherry pick PR is at #18387 The following tracker issues are updated:

[v1.2.0] Release Schedule and Tracker #17016 (comment)

Details for Dev Infra team

Raised by workflow job

Fix Voxtral Realtime runner flush

7550f52

mattjcly requested a review from lucylq as a code owner March 19, 2026 20:08

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026

Lint

f32cd7d

mattjcly marked this pull request as draft March 19, 2026 21:37

delay_tokens from metadata

579bb82

mattjcly marked this pull request as ready for review March 19, 2026 22:19

mergennachin requested review from Copilot, manuelcandales and mergennachin and removed request for lucylq March 19, 2026 23:00

Copilot started reviewing on behalf of mergennachin March 19, 2026 23:00 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

mergennachin added ciflow/metal ciflow/cuda labels Mar 20, 2026

mergennachin approved these changes Mar 20, 2026

View reviewed changes

mergennachin merged commit 4c51461 into pytorch:main Mar 20, 2026
238 of 242 checks passed

pytorchbot mentioned this pull request Mar 20, 2026

[v1.2.0] Release Schedule and Tracker #17016

Closed

nil-is-all added the module: examples Issues related to demos under examples/ label Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Voxtral Realtime runner flush#18339

Fix Voxtral Realtime runner flush#18339
mergennachin merged 3 commits intopytorch:mainfrom
mattjcly:matt/fix-voxtral-flush

mattjcly commented Mar 19, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026

Uh oh!

mattjcly commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

mergennachin commented Mar 20, 2026

Uh oh!

Uh oh!

manuelcandales commented Mar 20, 2026

Uh oh!

pytorchbot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

-    int64_t pad_to =
-        (((remaining + step - 1) / step) + right_pad_audio_steps) * step +
-        right_lookahead;
+    // Compute the number of pending audio steps excluding the right lookahead,
+    // ensuring at least one step, then add the model's delay steps.
+    const int64_t pending_audio =
+        std::max<int64_t>(remaining - right_lookahead, 0);
+    int64_t pending_audio_steps =
+        (pending_audio + step - 1) / step; // ceil(pending_audio / step)
+    if (pending_audio_steps < 1) {
+      pending_audio_steps = 1;
+    }
+    const int64_t total_steps = pending_audio_steps + right_pad_audio_steps;
+    const int64_t pad_to = total_steps * step + right_lookahead;

Conversation

mattjcly commented Mar 19, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18339

❌ 1 Awaiting Approval, 2 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Mar 19, 2026

This PR needs a release notes: label

Uh oh!

mattjcly commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Mar 20, 2026

Uh oh!

Uh oh!

manuelcandales commented Mar 20, 2026

Uh oh!

pytorchbot commented Mar 20, 2026

Cherry picking #18339

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mattjcly commented Mar 19, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Mar 19, 2026 •

edited

Loading

This PR needs a `release notes:` label

mattjcly commented Mar 19, 2026 •

edited

Loading