Skip to content

Fix Voxtral Realtime runner flush#18339

Merged
mergennachin merged 3 commits intopytorch:mainfrom
mattjcly:matt/fix-voxtral-flush
Mar 20, 2026
Merged

Fix Voxtral Realtime runner flush#18339
mergennachin merged 3 commits intopytorch:mainfrom
mattjcly:matt/fix-voxtral-flush

Conversation

@mattjcly
Copy link
Copy Markdown
Contributor

@mattjcly mattjcly commented Mar 19, 2026

Summary

VoxtralRealtimeRunner was outputting excessive duplicate tokens/gibberish on stream flush. In an audio file where I say "The weather is clear today" run like

voxtral_realtime_runner \
--model_path model.pte \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--streaming \
--audio_path audio.wav

I would get output: The weather is clear todayoday.</s>

I also experienced this with periods and many other circumstances with repeating tokens at the end of the stream.

Upon investigation into vLLM's (Mistrals recommended inferencing runner for Voxtral Realtime), I observed that vLLM finishes the stream by closing the streaming input and draining model-defined right-padding audio, whereas ExecuTorch flush() finished by switching into post-audio text-only decoding after audio ended. See vLLM ref: https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270.

Therefore, apply similar logic here by converting the model-defined transcription delay into a finite number of trailing silent streaming steps to properly conclude the stream.

After this, the same command outputs:

The weather is clear today.

As I would expect. No </s> b/c like vLLM, the stream ends by naturally draining the padded audio tail and letting the model emit whatever final delayed text it wants.

Test plan

Tested with above example and a few other audio files to observe the behavior improvement and lack of giberrish/incorrect end of stream.

cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng

@mattjcly mattjcly requested a review from lucylq as a code owner March 19, 2026 20:08
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18339

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 2 Cancelled Jobs, 2 Unrelated Failures

As of commit 579bb82 with merge base eb92cec (image):

AWAITING APPROVAL - The following workflow needs approval before CI can run:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mattjcly mattjcly marked this pull request as draft March 19, 2026 21:37
@mattjcly
Copy link
Copy Markdown
Contributor Author

mattjcly commented Mar 19, 2026

I realize now that "--delay-tokens" is a configurable model export variable, and as such we probably want to read this from exported model data instead of the tokenizer file. Moved to draft for this reason.

@mattjcly mattjcly marked this pull request as ready for review March 19, 2026 22:19
@mergennachin mergennachin requested review from Copilot, manuelcandales and mergennachin and removed request for lucylq March 19, 2026 23:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Voxtral Realtime example runner’s streaming flush() behavior to avoid duplicate trailing tokens by ending the stream via draining a finite padded audio tail (similar to vLLM), rather than switching into post-audio text-only decoding.

Changes:

  • Split transcription config into offline vs. streaming configs and update CLI flagging accordingly.
  • Add delay_tokens as exported model constant metadata and use it to pad trailing silence during streaming flush.
  • Update runner/docs to reflect the new audio-conditioned flush path.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
examples/models/voxtral_realtime/voxtral_realtime_runner.h Splits config types; adds delay_tokens_; updates streaming flush/decode signatures/docs.
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp Reads delay_tokens; removes text-only flush; pads silence tail and drains via normal streaming path.
examples/models/voxtral_realtime/main.cpp Renames CLI token-limit flag to --offline_max_new_tokens; enforces it is offline-only; adds streaming config helper.
examples/models/voxtral_realtime/export_voxtral_rt.py Exports delay_tokens as a constant method via metadata.
examples/models/voxtral_realtime/model.md Updates documentation of streaming flush() semantics.
examples/models/voxtral_realtime/README.md Updates CLI docs for the renamed offline-only max token flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +677 to +679
int64_t pad_to =
(((remaining + step - 1) / step) + right_pad_audio_steps) * step +
right_lookahead;
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In flush(), pad_to is computed using ceil(remaining/step). When the stream ends with < stft_right_lookahead_ samples after the last full step (i.e., audio_len = N*step + r with 0<r<right_lookahead), remaining becomes step+r, so ceil(remaining/step) evaluates to 2 and you end up padding an extra full step of silence. That adds an unintended extra decode step (beyond the last real audio step + lookahead + delay) and can reintroduce spurious trailing tokens.

Consider computing the number of pending audio steps based on (remaining - right_lookahead) (clamped) and rounding that up to step, ensuring at least 1 step, then adding delay_tokens_ steps on top. This avoids over-padding by a full step when only the lookahead is partially missing.

Suggested change
int64_t pad_to =
(((remaining + step - 1) / step) + right_pad_audio_steps) * step +
right_lookahead;
// Compute the number of pending audio steps excluding the right lookahead,
// ensuring at least one step, then add the model's delay steps.
const int64_t pending_audio =
std::max<int64_t>(remaining - right_lookahead, 0);
int64_t pending_audio_steps =
(pending_audio + step - 1) / step; // ceil(pending_audio / step)
if (pending_audio_steps < 1) {
pending_audio_steps = 1;
}
const int64_t total_steps = pending_audio_steps + right_pad_audio_steps;
const int64_t pad_to = total_steps * step + right_lookahead;

Copilot uses AI. Check for mistakes.
@mergennachin
Copy link
Copy Markdown
Contributor

Thanks @mattjcly for the contribution. The bug fix looks correct

@mergennachin mergennachin merged commit 4c51461 into pytorch:main Mar 20, 2026
238 of 242 checks passed
@manuelcandales
Copy link
Copy Markdown
Contributor

@pytorchbot cherry-pick --onto release/1.2 -c release

pytorchbot pushed a commit that referenced this pull request Mar 20, 2026
### Summary

`VoxtralRealtimeRunner` was outputting excessive duplicate
tokens/gibberish on stream flush. In an audio file where I say "The
weather is clear today" run like

```
voxtral_realtime_runner \
--model_path model.pte \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--streaming \
--audio_path audio.wav
```

I would get output: `The weather is clear todayoday.</s>`

I also experienced this with periods and many other circumstances with
repeating tokens at the end of the stream.

Upon investigation into vLLM's (Mistrals recommended inferencing runner
for [Voxtral
Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602#vllm-recommended)),
I observed that vLLM finishes the stream by closing the streaming input
and draining model-defined right-padding audio, whereas ExecuTorch
`flush()` finished by switching into post-audio text-only decoding after
audio ended. See vLLM ref:
https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270.

Therefore, apply similar logic here by converting the model-defined
transcription delay into a finite number of trailing silent streaming
steps to properly conclude the stream.

After this, the same command outputs:
```
The weather is clear today.
```

As I would expect. No `</s>` b/c like vLLM, the stream ends by naturally
draining the padded audio tail and letting the model emit whatever final
delayed text it wants.

### Test plan
Tested with above example and a few other audio files to observe the
behavior improvement and lack of giberrish/incorrect end of stream.

(cherry picked from commit 4c51461)
@pytorchbot
Copy link
Copy Markdown
Collaborator

Cherry picking #18339

The cherry pick PR is at #18387 The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

@nil-is-all nil-is-all added the module: examples Issues related to demos under examples/ label Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda ciflow/metal CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: examples Issues related to demos under examples/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants