# Qwen2.5-Omni音频理解全能指南


使用Qwen2.5-Omni完成诸如语音识别、语音转文本翻译和音频分析等音频任务。


# 环境准备

如果前面已经安装准备完成，跳过此章节

## 安装相关依赖
安装transformers、qwen-omni-utils、flash-attn --no-build-isolation

```bash
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install accelerate
pip install qwen-omni-utils
pip install -U flash-attn --no-build-isolation
pip install triton
# 用于从魔塔模型库下载模型
pip install modelscope
```



## 模型下载准备

需要提前将模型下载到本地备用
使用 modelscope 中的 snapshot_download 函数下载模型（提前安装modelscope）。
第一个参数为模型名称，第二个参数 cache_dir 用于指定模型的下载路径.
```python

from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-Omni-7B', cache_dir='/root/autodl-tmp', revision='master')

```

# 加载模型&导入依赖

In [2]:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
import librosa
from io import BytesIO
from urllib.request import urlopen
from IPython.display import Audio


model_path = "/root/autodl-tmp/Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Qwen2_5OmniToken2WavModel must inference with fp32, but flash_attention_2 only supports fp16 and bf16, attention implementation of Qwen2_5OmniToken2WavModel will fallback to sdpa.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

# 定义推理函数

主要作用是通过多模态输入（文本和音频）与模型进行交互，并生成文本输出。它结合了自然语言处理（NLP）和音频处理的能力，能够处理复杂的输入并生成相应的文本回复。

In [3]:
from qwen_omni_utils import process_mm_info  # 导入用于处理多模态信息的工具函数

# @title inference function
def inference(audio_path, prompt, sys_prompt):
    """
    推理函数，用于处理音频输入和文本提示，并生成文本输出。

    参数：
    - audio_path: 音频文件路径
    - prompt: 用户提供的文本提示
    - sys_prompt: 系统提示，用于定义模型的角色和行为

    返回值：
    - text: 模型生成的文本输出
    """
    # 构建输入消息，包括系统消息和用户消息
    messages = [
        {"role": "system", "content": sys_prompt},  # 系统消息，定义模型的角色和行为
        {"role": "user", "content": [  # 用户消息，包含文本提示和音频输入
                {"type": "text", "text": prompt},  # 文本提示
                {"type": "audio", "audio": audio_path},  # 音频输入路径
            ]
        },
    ]

    # 使用 processor 应用聊天模板，将消息转换为模型可处理的文本格式
    # 参数 tokenize=False 表示不对文本进行分词处理，add_generation_prompt=True 表示添加生成提示
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("text:", text)  # 打印处理后的文本，用于调试

    # 调用 process_mm_info 函数处理多模态信息
    # 提取音频、图像和视频信息，use_audio_in_video=True 表示在视频中使用音频信息
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    # 使用 processor 将文本、音频、图像和视频信息转换为模型输入张量
    # return_tensors="pt" 表示返回 PyTorch 张量，padding=True 表示对输入进行填充以匹配最大长度
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)

    # 将输入张量移动到模型所在的设备和数据类型
    inputs = inputs.to(model.device).to(model.dtype)

    # 调用模型的 generate 方法生成输出
    # use_audio_in_video=True 表示在生成过程中使用音频信息，return_audio=False 表示不返回音频输出
    output = model.generate(**inputs, use_audio_in_video=True, return_audio=False)

    # 解码生成的文本输出
    # skip_special_tokens=True 表示跳过特殊标记，clean_up_tokenization_spaces=False 表示不清理多余的空格
    text = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)

    # 返回生成的文本结果
    return text

# 语音识别

## 识别语音-转录文本（英语）

这段代码的主要功能是通过一个本地的 Qwen2.5-Omni 多模态模型将音频文件转录为文本。

In [7]:
#audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
audio_path = "files/1272-128104-0000.flac"
# 定义用户提示，要求模型将英文音频转录为文本，并且不包含任何标点符号
prompt = "Transcribe the English audio into text without any punctuation marks."

#audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
 # 检查音频路径是否为网络链接
if audio_path.startswith("http"):
    # 如果是网络链接，使用 urlopen 和 BytesIO 加载音频
    audio, sr = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)
else:
    # 如果是本地文件路径，直接加载
    audio, sr = librosa.load(audio_path, sr=16000)
display(Audio(audio, rate=sr)) # 使用 IPython.display.Audio 播放音频，指定采样率

# 调用本地 HuggingFace 模型进行推理
# 1. 传递音频路径、用户提示和系统提示给 inference 函数
# 2. 系统提示定义了模型的角色为“一个语音识别模型”
# 3. 模型将根据音频和提示生成文本转录结果
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the English audio into text without any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
Transcribe the English audio into text without any punctuation marks.
assistant
mr quilter is the apostle of the middle classes and we are glad to welcome his gospel


## 识别语音-转录文本（中文）

In [9]:
#audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/BAC009S0764W0121.wav"
audio_path = "files/BAC009S0764W0121.wav"
prompt = "请将这段中文语音转换为纯文本，去掉标点符号。"

 # 检查音频路径是否为网络链接
if audio_path.startswith("http"):
    # 如果是网络链接，使用 urlopen 和 BytesIO 加载音频
    audio, sr = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)
else:
    # 如果是本地文件路径，直接加载
    audio, sr = librosa.load(audio_path, sr=16000)
#audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## 使用HuggingFace进行推理；系统提示定义了模型的角色为“一个语音识别模型”
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\n请将这段中文语音转换为纯文本，去掉标点符号。<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
请将这段中文语音转换为纯文本，去掉标点符号。
assistant
甚至出现交易几乎停滞的情况


## 识别语音-转录文本（俄语）

In [12]:
#audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/10000611681338527501.wav"
audio_path = "files/10000611681338527501.wav"
prompt = "Transcribe the Russian audio into text without including any punctuation marks."

#audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
audio = librosa.load(audio_path, sr=16000)[0] #加载本地路径
display(Audio(audio, rate=16000))

## 使用HuggingFace进行推理；系统提示定义了模型的角色为“一个语音识别模型”
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the Russian audio into text without including any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']
system
You are a speech recognition model.
user
Transcribe the Russian audio into text without including any punctuation marks.
assistant
в древнем китае использовали уникальный способ обозначения периодов времени каждый этап китая или каждая семья находившаяся у власти были особой династии


## 识别语音-转录文本（法语）

In [13]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/7105431834829365765.wav"
prompt = "Transcribe the French audio into text without including any punctuation marks."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## 使用HuggingFace进行推理；系统提示定义了模型的角色为“一个语音识别模型”
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech recognition model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech recognition model.<|im_end|>\n<|im_start|>user\nTranscribe the French audio into text without including any punctuation marks.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


system
You are a speech recognition model.
user
Transcribe the French audio into text without including any punctuation marks.
assistant
les voyageurs à destination de pays où les taxes sont élevées peuvent parfois faire des économies considérables en particulier sur des produits comme les boissons alcoolisées ou le tabac


# 语音翻译

In [14]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
# 定义用户提示，要求模型对给定的人类声音进行分类，并用英语描述结果
prompt = "Listen to the provided English speech and produce a translation in Chinese text."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

## 使用HuggingFace进行推理；系统提示定义了模型的角色为“一个语音翻译模型”
response = inference(audio_path, prompt=prompt, sys_prompt="You are a speech translation model.")
print(response[0])



text: ['<|im_start|>system\nYou are a speech translation model.<|im_end|>\n<|im_start|>user\nListen to the provided English speech and produce a translation in Chinese text.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


system
You are a speech translation model.
user
Listen to the provided English speech and produce a translation in Chinese text.
assistant
奎尔特先生是中产阶级的使徒,我们很高兴欢迎他的福音。


# 声音分类

In [15]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"
prompt = "Classify the given human vocal sound in English."

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

# 调用本地 HuggingFace 模型进行推理；系统提示定义了模型的角色为“一个声音分类模型”
response = inference(audio_path, prompt=prompt, sys_prompt="You are a vocal sound classification model.")
print(response[0])



text: ['<|im_start|>system\nYou are a vocal sound classification model.<|im_end|>\n<|im_start|>user\nClassify the given human vocal sound in English.<|audio_bos|><|AUDIO|><|audio_eos|><|im_end|>\n<|im_start|>assistant\n']


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


system
You are a vocal sound classification model.
user
Classify the given human vocal sound in English.
assistant
Cough
