Include audio preprocessing for raw audio tensor #13752

jackzhxng · 2025-08-28T03:21:38Z

Summary

Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor .bin file, using an exported .pte from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py

Current limitations - no batching of output in the spectrogram processing module, so can only support audio of <30 seconds.

The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
 the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:24.036773 executorch:stats.h:104]       Prompt Tokens: 388    Generated Tokens: 99
I 00:00:24.036800 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:24.036805 executorch:stats.h:117]       Total inference time:           16.221000 (seconds)              Rate:  6.103200 (tokens/second)
I 00:00:24.036815 executorch:stats.h:127]               Prompt evaluation:      5.054000 (seconds)               Rate:  76.770875 (tokens/second)
I 00:00:24.036819 executorch:stats.h:136]               Generated 99 tokens:    11.167000 (seconds)              Rate:  8.865407 (tokens/second)
I 00:00:24.036822 executorch:stats.h:147]       Time to first generated token:  5.054000 (seconds)
I 00:00:24.036828 executorch:stats.h:153]       Sampling time over 487 tokens:  0.099000 (seconds)

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2025-08-28T03:21:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13752

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ca19f68 with merge base 99e6349 ():

NEW FAILURE - The following job has failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 1ce35e1 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: 658b6f8 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: d693c52 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: b2f0f19 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: 397e8a3 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: b193412 Pull Request resolved: #13752

## Summary Runs audio preprocessing (mel spectrogram conversion) on raw audio tensor, using an exported `.pte` from https://github.com/pytorch/executorch/blob/main/extension/audio/mel_spectrogram.py Current limitations - no batching support in the mel spectrogram, so can only support audio of <30 seconds. ``` The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99 I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second) I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second) I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second) I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds) I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds) ``` [ghstack-poisoned]

ghstack-source-id: b9c5709 Pull Request resolved: #13752

…tensor" (Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

…tensor" (Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

ghstack-source-id: b9c5709 Pull Request resolved: #13752

…tensor" (Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13752) Differential Revision: [D81498748](https://our.internmc.facebook.com/intern/diff/D81498748) [ghstack-poisoned]

Include audio preprocessing for raw audio tensor

e928315

[ghstack-poisoned]

jackzhxng requested a review from lucylq as a code owner August 28, 2025 03:21

This was referenced Aug 28, 2025

Add audio to multimodal runner #13662

Merged

Add Voxtral runner #13663

Merged

jackzhxng mentioned this pull request Aug 28, 2025

Make token and stat callback optional in multimodal runner #13664

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2025

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

844132d

ghstack-source-id: 1ce35e1 Pull Request resolved: #13752

jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Aug 28, 2025

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

93791af

ghstack-source-id: 658b6f8 Pull Request resolved: #13752

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

21eb519

ghstack-source-id: d693c52 Pull Request resolved: #13752

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

c255041

ghstack-source-id: b2f0f19 Pull Request resolved: #13752

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

3b38157

ghstack-source-id: 397e8a3 Pull Request resolved: #13752

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

26d8ae7

ghstack-source-id: b193412 Pull Request resolved: #13752

jackzhxng added a commit that referenced this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor

64c2753

ghstack-source-id: b9c5709 Pull Request resolved: #13752

jackzhxng requested review from mergennachin, kimishpatel and kirklandsign August 29, 2025 17:25

mergennachin approved these changes Aug 29, 2025

View reviewed changes

jackzhxng merged commit ca19f68 into gh/jackzhxng/33/base Sep 2, 2025
111 of 112 checks passed

jackzhxng deleted the gh/jackzhxng/33/head branch September 2, 2025 02:54

jackzhxng had a problem deploying to cherry-pick-bot September 2, 2025 02:54 — with GitHub Actions Failure

pytorchbot mentioned this pull request Sep 2, 2025

Include audio preprocessing for raw audio tensor #13855

Merged

jackzhxng mentioned this pull request Sep 2, 2025

Include audio preprocessing for raw audio tensor #13873

Merged

kirklandsign pushed a commit that referenced this pull request Sep 3, 2025

Include audio preprocessing for raw audio tensor

f6fdd5d

ghstack-source-id: b9c5709 Pull Request resolved: #13752

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include audio preprocessing for raw audio tensor #13752

Include audio preprocessing for raw audio tensor #13752

Uh oh!

jackzhxng commented Aug 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Include audio preprocessing for raw audio tensor #13752

Include audio preprocessing for raw audio tensor #13752

Uh oh!

Conversation

jackzhxng commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13752

❌ 1 New Failure

Uh oh!

Uh oh!

Uh oh!

jackzhxng commented Aug 28, 2025 •

edited

Loading

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading