-
Notifications
You must be signed in to change notification settings - Fork 685
Whisper audio processor #13538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper audio processor #13538
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13538
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 17 Pending, 1 Unrelated FailureAs of commit d64bdba with merge base 9359481 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D80215714 |
This PR needs a
|
Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Differential Revision: D80215714
50c5ad4
to
7723951
Compare
This pull request was exported from Phabricator. Differential Revision: D80215714 |
7723951
to
f3acabf
Compare
Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Differential Revision: D80215714
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the output file in the unit test and it's all good
cc @msluszniak @chmjkb @mkopcins FYI the audio preprocessing part can be now just exported |
f3acabf
to
4ad4ab5
Compare
Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
This pull request was exported from Phabricator. Differential Revision: D80215714 |
Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
4ad4ab5
to
795b80f
Compare
This pull request was exported from Phabricator. Differential Revision: D80215714 |
Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
795b80f
to
da05959
Compare
Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
da05959
to
23c0fc6
Compare
This pull request was exported from Phabricator. Differential Revision: D80215714 |
Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
23c0fc6
to
f90b4fe
Compare
Summary: Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
f90b4fe
to
186fd1f
Compare
Summary: Pull Request resolved: pytorch#13538 Running Whisper to convert audio -> text consists of two steps: (1) Audio preprocessor (aka Mel spectrogram feature extractor), and (2) Whisper model (encoder+decoder) Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output **This class implements part (1), the audio processing stage, in PyTorch.** It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2). The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5). Reviewed By: jackzhxng Differential Revision: D80215714
This pull request was exported from Phabricator. Differential Revision: D80215714 |
186fd1f
to
d64bdba
Compare
Differential Revision: D80215714 Pull Request resolved: pytorch#13538
Summary:
Running Whisper to convert audio -> text consists of two steps:
(1) Audio preprocessor (aka Mel spectrogram feature extractor), and
(2) Whisper model (encoder+decoder)
Currently, in examples/qualcomm/oss_scripts/whisper, we have a flow for exporting Whisper encoder+decoder (2) and running it on device. It can take in Mel spectrogram tensors as input and produce text output
This class implements part (1), the audio processing stage, in PyTorch. It is equivalent to HuggingFace WhisperFeatureExtractor (which computes Mel spectrograms with NumPy). It takes in an audio waveform at 16KHz (as a 1D tensor) and outputs Mel spectrograms that can be fed directly to the Whisper model (2).
The script (see test plan) compares this class agains WhisperFeatureExtractor, they have a very small numerical discrepancy (<1e-5).
Differential Revision: D80215714