[multimodal] Let Audio take float data blob #14427

larryliu0820 · 2025-09-19T03:01:10Z

If the processed audio went through Mel transform, the spectrogram are float values. We should allow Audio class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like:

model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(model_id)
audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav"
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": audio_url},
            {
                "type": "text",
                "text": "What can you tell me about this audio?",
            },
        ],
    },
]
inputs = processor.apply_chat_template(conversation,
    tokenize=True,
    return_dict=True,
    return_tensors="pt")

inputs_combined = [
    make_text_input("<s>[INST][BEGIN_AUDIO]"),
    make_audio_input(inputs["input_features"]),
    make_text_input("\nWhat can you tell me about this audio?[/INST]"),
]
runner = MultimodalRunner("voxtral.pte", "tekken.json", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ```

pytorch-bot · 2025-09-19T03:01:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14427

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Pending

As of commit 8b6a3d3 with merge base 0f22062 ():

NEW FAILURE - The following job has failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t c5567c0cdb66acaa348315f0e37fc04024ad045449f9bea7d233dac4fad0224d /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mergennachin

See inline

extension/llm/runner/audio.h

extension/llm/runner/multimodal_prefiller.cpp

extension/llm/runner/audio.h

If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

larryliu0820 · 2025-10-09T22:14:51Z

@pytorchbot cherry-pick --onto release/1.0 -c critical

If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable. (cherry picked from commit 8b11418)

pytorchbot · 2025-10-09T22:17:24Z

Cherry picking #14427

The cherry pick PR is at #14971 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v1.0.0] Release Tracker #14288 (comment)

Details for Dev Infra team

Raised by workflow job

larryliu0820 requested review from jackzhxng, lucylq, mergennachin and swolchok as code owners September 19, 2025 03:01

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2025

Add include

a246998

larryliu0820 added the release notes: llm Changes to llm utilities label Sep 19, 2025

mergennachin approved these changes Sep 19, 2025

View reviewed changes

extension/llm/runner/audio.h Outdated Show resolved Hide resolved

extension/llm/runner/audio.h Show resolved Hide resolved

extension/llm/runner/audio.h Outdated Show resolved Hide resolved

jackzhxng approved these changes Sep 19, 2025

View reviewed changes

extension/llm/runner/multimodal_prefiller.cpp Show resolved Hide resolved

extension/llm/runner/audio.h Outdated Show resolved Hide resolved

larryliu0820 added 3 commits September 19, 2025 10:52

Address comments

821d085

Fix typo

0db2588

Lintrunner

8b6a3d3

larryliu0820 merged commit 8b11418 into main Sep 19, 2025
125 of 126 checks passed

larryliu0820 deleted the audio_float branch September 19, 2025 21:51

pytorchbot mentioned this pull request Oct 9, 2025

[v1.0.0] Release Tracker #14288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[multimodal] Let Audio take float data blob #14427

[multimodal] Let Audio take float data blob #14427

Uh oh!

larryliu0820 commented Sep 19, 2025

Uh oh!

pytorch-bot bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

mergennachin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

larryliu0820 commented Oct 9, 2025

Uh oh!

pytorchbot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[multimodal] Let Audio take float data blob #14427

[multimodal] Let Audio take float data blob #14427

Uh oh!

Conversation

larryliu0820 commented Sep 19, 2025

Summary

Test plan

Uh oh!

pytorch-bot bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14427

❌ 1 New Failure, 3 Pending

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

larryliu0820 commented Oct 9, 2025

Uh oh!

pytorchbot commented Oct 9, 2025

Cherry picking #14427

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Sep 19, 2025 •

edited

Loading