-
Notifications
You must be signed in to change notification settings - Fork 685
[multimodal] Let Audio take float data blob #14427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ```
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14427
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 PendingAs of commit 8b6a3d3 with merge base 0f22062 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline
If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.
@pytorchbot cherry-pick --onto release/1.0 -c critical |
If the processed audio went through Mel transform, the spectrogram are float values. We should allow `Audio` class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like: ```python model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) audio_url = "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav" conversation = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, { "type": "text", "text": "What can you tell me about this audio?", }, ], }, ] inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<s>[INST][BEGIN_AUDIO]"), make_audio_input(inputs["input_features"]), make_text_input("\nWhat can you tell me about this audio?[/INST]"), ] runner = MultimodalRunner("voxtral.pte", "tekken.json", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable. (cherry picked from commit 8b11418)
Cherry picking #14427The cherry pick PR is at #14971 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
If the processed audio went through Mel transform, the spectrogram are float values. We should allow
Audio
class to be able to take this, since multimodal runner pybind API will have to be able to take processed input. Once we have the pybind API we can do something like:Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a
Fixes #<issue-id>
line.[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.