### Universal Audio Understanding for Qwen2.5-Omni

This notebook demonstrates how to use Qwen2.5-Omni to finish audio tasks such as speech recongnition, speech-to-text translation and audio analysis.

In [None]:
!pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
!pip install qwen-omni-utils
!pip install openai
!pip install flash-attn --no-build-isolation

In [2]:
from qwen_omni_utils import process_mm_info

# @title inference function
def inference(audio_path, prompt, sys_prompt):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "audio", "audio": audio_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("text:", text)
    # image_inputs, video_inputs = process_vision_info([messages])
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)
    inputs = inputs.to(model.device).to(model.dtype)

    output = model.generate(**inputs, use_audio_in_video=True, return_audio=False)

    text = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    return text



Load model and processors.

In [3]:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

model_path = "Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm
2025-03-29 12:32:42.948355: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-29 12:32:42.983611: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-29 12:32:42.983643: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-29 12:32:42.984557: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-29 12:32:42.9

In [4]:
import librosa

from io import BytesIO
from urllib.request import urlopen

from IPython.display import Audio

#### 1. Speeh Recognition

In [5]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
prompt = "Transcribe the English audio into text without any punctuation marks."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the English audio into text without any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
Transcribe the English audio into text without any punctuation marks.
assistant
mr quilter is the apostle of the middle classes and we are glad to welcome his gospel


In [6]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/BAC009S0764W0121.wav"
prompt = "请将这段中文语音转换为纯文本，去掉标点符号。"

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\n请将这段中文语音转换为纯文本，去掉标点符号。<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
请将这段中文语音转换为纯文本，去掉标点符号。
assistant
甚至出现交易几乎停滞的情况


In [7]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/10000611681338527501.wav"
prompt = "Transcribe the Russian audio into text without including any punctuation marks."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the Russian audio into text without including any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
Transcribe the Russian audio into text without including any punctuation marks.
assistant
в древнем китае использовали уникальный способ обозначения периодов времени каждый этап китая или каждая семья находившаяся у власти были особой династии


In [8]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/7105431834829365765.wav"
prompt = "Transcribe the French audio into text without including any punctuation marks."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the French audio into text without including any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
Transcribe the French audio into text without including any punctuation marks.
assistant
les voyageurs à destination de pays où les taxes sont élevées peuvent parfois faire des économies considérables en particulier sur des produits comme les boissons alcoolisées ou le tabac


#### 2. Speech Translation

In [9]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
prompt = "Listen to the provided English speech and produce a translation in Chinese text."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech translation model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech translation model.<|im_end|>\n<|im_start|>user\nListen to the provided English speech and produce a translation in Chinese text.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech translation model.
user
Listen to the provided English speech and produce a translation in Chinese text.
assistant
奎尔特先生是中产阶级的使徒,我们很高兴欢迎他的福音。


#### 3. Vocal Sound Classification

In [10]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"
prompt = "Classify the given human vocal sound in English."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path, prompt=prompt, sys_prompt="You are a vocal sound classification model.")
print(response[0])



text: ['<|im_start|>system\nYou are a vocal sound classification model.<|im_end|>\n<|im_start|>user\nClassify the given human vocal sound in English.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a vocal sound classification model.
user
Classify the given human vocal sound in English.
assistant
Cough
