Skip to content

Add Audio inputs available in apply_chat_template #36769

Closed
@junnei

Description

@junnei

Feature request

Hello, I would like to request support for audio processing in the apply_chat_template function.

Motivation

With the rapid advancement of multimodal models, audio processing has become increasingly crucial alongside image and text inputs. Models like Qwen2-Audio, Phi-4-multimodal, and various models now support audio understanding, making this feature essential for modern AI applications.

Supporting audio inputs would enable:

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Follow the instruction in the audio with this image."}
        ]
    }
]

This enhancement would significantly expand the capabilities of the library to handle the full spectrum of multimodal inputs that state-of-the-art models now support, keeping the transformers library at the forefront of multimodal AI development.

Your contribution

I've tested this implementation with several multimodal models and it works well for processing audio inputs alongside images and text. I'd be happy to contribute this code to the repository if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions