Whisper audio processor #13538

rohansjoshi · 2025-08-20T01:43:06Z

Summary:
Running Whisper to convert audio -> text consists of two steps:
(1) Audio preprocessor (aka Mel spectrogram feature extractor), and
(2) Whisper model (encoder+decoder)
Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output

This class implements part (1), the audio processing stage, in PyTorch. It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2).

The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5).

Differential Revision: D80215714

pytorch-bot · 2025-08-20T01:43:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13538

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 17 Pending, 1 Unrelated Failure

As of commit d64bdba with merge base 9359481 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh)
RuntimeError: Command docker exec -t 1c120c7182c529c69cc4f50e086c3afe50486e24f4f1dd248a4d78496336d167 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job (gh)
RuntimeError: Command docker exec -t 030bab030167436ffaa17e5cb8c4b87f5e5c767f2ec2166fc953002fa18e3c65 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-models-linux (emformer_transcribe, portable, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-20T01:43:17Z

This pull request was exported from Phabricator. Differential Revision: D80215714

github-actions · 2025-08-20T01:43:52Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Differential Revision: D80215714

facebook-github-bot · 2025-08-20T04:28:37Z

This pull request was exported from Phabricator. Differential Revision: D80215714

Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Differential Revision: D80215714

cccclai

Please update the output file in the unit test and it's all good

mergennachin · 2025-08-20T15:22:22Z

cc @msluszniak @chmjkb @mkopcins FYI the audio preprocessing part can be now just exported

Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

facebook-github-bot · 2025-08-20T21:40:12Z

This pull request was exported from Phabricator. Differential Revision: D80215714

Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

facebook-github-bot · 2025-08-20T22:23:38Z

This pull request was exported from Phabricator. Differential Revision: D80215714

Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

facebook-github-bot · 2025-08-20T23:03:13Z

This pull request was exported from Phabricator. Differential Revision: D80215714

Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714

facebook-github-bot · 2025-08-21T02:33:35Z

This pull request was exported from Phabricator. Differential Revision: D80215714

Differential Revision: D80215714 Pull Request resolved: pytorch#13538

rohansjoshi requested review from jackzhxng and lucylq as code owners August 20, 2025 01:43

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 20, 2025

facebook-github-bot added the fb-exported label Aug 20, 2025

rohansjoshi force-pushed the export-D80215714 branch from 50c5ad4 to 7723951 Compare August 20, 2025 04:24

rohansjoshi force-pushed the export-D80215714 branch from 7723951 to f3acabf Compare August 20, 2025 04:28

cccclai approved these changes Aug 20, 2025

View reviewed changes

mergennachin requested a review from kimishpatel August 20, 2025 15:10

rohansjoshi force-pushed the export-D80215714 branch from f3acabf to 4ad4ab5 Compare August 20, 2025 21:40

rohansjoshi force-pushed the export-D80215714 branch from 4ad4ab5 to 795b80f Compare August 20, 2025 22:20

rohansjoshi force-pushed the export-D80215714 branch from 795b80f to da05959 Compare August 20, 2025 22:23

rohansjoshi force-pushed the export-D80215714 branch from da05959 to 23c0fc6 Compare August 20, 2025 22:59

rohansjoshi force-pushed the export-D80215714 branch from 23c0fc6 to f90b4fe Compare August 20, 2025 23:03

rohansjoshi force-pushed the export-D80215714 branch from f90b4fe to 186fd1f Compare August 21, 2025 02:29

rohansjoshi force-pushed the export-D80215714 branch from 186fd1f to d64bdba Compare August 21, 2025 02:33

facebook-github-bot merged commit 624b38e into pytorch:main Aug 21, 2025
99 of 106 checks passed

agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request Aug 26, 2025

Whisper audio processor

4019da4

Differential Revision: D80215714 Pull Request resolved: pytorch#13538

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper audio processor #13538

Whisper audio processor #13538

Uh oh!

rohansjoshi commented Aug 20, 2025

Uh oh!

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

cccclai left a comment

Uh oh!

mergennachin commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Whisper audio processor #13538

Whisper audio processor #13538

Uh oh!

Conversation

rohansjoshi commented Aug 20, 2025

Uh oh!

pytorch-bot bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13538

❌ 3 New Failures, 17 Pending, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

github-actions bot commented Aug 20, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 20, 2025

Uh oh!

facebook-github-bot commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading

This PR needs a `release notes:` label