### Voice Chatting with Qwen2.5-Omni

This notebook demonstrates how to chat with Qwen2.5-Omni by voice input and output.

In [None]:
!pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
!pip install qwen-omni-utils
!pip install openai
!pip install flash-attn --no-build-isolation

In [None]:
from qwen_omni_utils import process_mm_info

# @title inference function
def inference(audio_path):
    messages = [
        {"role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."},
        {"role": "user", "content": [
                {"type": "audio", "audio": audio_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)
    inputs = inputs.to(model.device).to(model.dtype)

    output = model.generate(**inputs, use_audio_in_video=True, return_audio=True)

    text = processor.batch_decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    audio = output[1]
    return text, audio



Load model and processors.

In [3]:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

model_path = "Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm
2025-03-22 17:31:24.942300: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-22 17:31:24.975733: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-22 17:31:24.975757: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-22 17:31:24.976608: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-22 17:31:24.9

In [4]:
import librosa

from io import BytesIO
from urllib.request import urlopen

from IPython.display import Audio

#### Voice Chat Example 1

In [None]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path)
print(response[0][0])
display(Audio(response[1], rate=24000))

Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.


system
You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.
user

assistant
Well, I can't really guess your age and gender just from your voice. I don't have that kind of superpower. But if you want to talk about something else, like your hobbies or what you like to do, that could be really interesting.


#### Voice Chat Example 2

In [None]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path)
print(response[0][0])
display(Audio(response[1], rate=24000))

Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.


system
You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.
user

assistant
每个人都想被欣赏，所以如果你欣赏某人，就别让它成为秘密。如果还有其他翻译相关的问题或者别的事，都可以跟我说哦。
