Description
Feature request
Hello, I would like to request support for audio processing in the apply_chat_template function.
Motivation
With the rapid advancement of multimodal models, audio processing has become increasingly crucial alongside image and text inputs. Models like Qwen2-Audio, Phi-4-multimodal, and various models now support audio understanding, making this feature essential for modern AI applications.
Supporting audio inputs would enable:
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Follow the instruction in the audio with this image."}
]
}
]
This enhancement would significantly expand the capabilities of the library to handle the full spectrum of multimodal inputs that state-of-the-art models now support, keeping the transformers library at the forefront of multimodal AI development.
Your contribution
I've tested this implementation with several multimodal models and it works well for processing audio inputs alongside images and text. I'd be happy to contribute this code to the repository if there's interest.