Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Audio inputs available in apply_chat_template #36769

Open
junnei opened this issue Mar 17, 2025 · 1 comment
Open

Add Audio inputs available in apply_chat_template #36769

junnei opened this issue Mar 17, 2025 · 1 comment
Labels
Feature request Request for a new feature

Comments

@junnei
Copy link

junnei commented Mar 17, 2025

Feature request

Hello, I would like to request support for audio processing in the apply_chat_template function.

Motivation

With the rapid advancement of multimodal models, audio processing has become increasingly crucial alongside image and text inputs. Models like Qwen2-Audio, Phi-4-multimodal, and various models now support audio understanding, making this feature essential for modern AI applications.

Supporting audio inputs would enable:

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Follow the instruction in the audio with this image."}
        ]
    }
]

This enhancement would significantly expand the capabilities of the library to handle the full spectrum of multimodal inputs that state-of-the-art models now support, keeping the transformers library at the forefront of multimodal AI development.

Your contribution

I've tested this implementation with several multimodal models and it works well for processing audio inputs alongside images and text. I'd be happy to contribute this code to the repository if there's interest.

@junnei junnei added the Feature request Request for a new feature label Mar 17, 2025
@zucchini-nlp
Copy link
Member

Hey! Yes, the PR is in progress and can be tracked here #34601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants