Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. Model supports more than 8 languages and dialects, e.g., Chinese, English, Cantonese, French, Italian, Spanish, German, and Japanese and can work in two distinct audio interaction modes:
- voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
- audio analysis: users could provide audio and text instructions for analysis during the interaction;
More details about model can be found in model card, blog, original repository and technical report.
In this tutorial we consider how to convert and optimize Qwen2Audio model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using NNCF
The tutorial consists from following steps:
- Install requirements
- Convert and Optimize model
- Run OpenVINO model inference
- Launch Interactive demo
In this demonstration, you'll create interactive chatbot that accepts instructions by voice and answer questions about provided audio's content.
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.