# Qwen2.5-Omni：语音聊天的实践应用与体验
通过语音输入和输出与Qwen2.5-Omni聊天。

# 环境准备

如果前面已经安装准备完成，跳过此章节

## 安装相关依赖
安装transformers、qwen-omni-utils、flash-attn --no-build-isolation

```bash
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install accelerate
pip install triton
pip install qwen-omni-utils
pip install -U flash-attn --no-build-isolation
# 用于从魔塔模型库下载模型
pip install modelscope
```



## 模型下载准备

需要提前将模型下载到本地备用
使用 modelscope 中的 snapshot_download 函数下载模型（提前安装modelscope）。
第一个参数为模型名称，第二个参数 cache_dir 用于指定模型的下载路径.
```python

from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-Omni-7B', cache_dir='/root/autodl-tmp', revision='master')

```

# 加载模型&导入依赖

加载本地下载的模型，同时启用flash_attention_2加速

In [3]:
import torch
torch.cuda.empty_cache() #释放GPU

In [4]:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
import librosa
from io import BytesIO
from urllib.request import urlopen
from IPython.display import Audio

model_path = "/root/autodl-tmp/Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Qwen2_5OmniToken2WavModel must inference with fp32, but flash_attention_2 only supports fp16 and bf16, attention implementation of Qwen2_5OmniToken2WavModel will fallback to sdpa.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

# 定义推理函数

用于处理音频输入并生成文本和音频输出

In [5]:
from qwen_omni_utils import process_mm_info

# @title inference function
def inference(audio_path):
    """
    执行推理任务，根据输入的音频路径生成文本和音频输出。

    此函数的主要流程包括：
    1. 构造输入消息，包含系统角色和用户角色，用户角色包含音频输入。
    2. 使用 processor 处理消息并生成文本模板。
    3. 提取多媒体信息（音频、图像、视频）并进行预处理。
    4. 将处理后的输入数据传递给模型进行生成任务。
    5. 解码模型输出，提取文本和音频结果。

    Args:
        audio_path (str): 输入音频文件的路径。

    Returns:
        tuple: 包含生成的文本（str）和音频（具体格式取决于模型输出）。
    """
    # 构造输入消息，包含系统角色和用户角色
    messages = [
        {"role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."},
        # 用户角色的消息，包含音频输入
        {"role": "user", "content": [
                {"type": "audio", "audio": audio_path},
            ]
        },
    ]

    # 使用 processor 处理消息，生成文本模板，并添加生成提示
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # 提取多媒体信息，包括音频、图像和视频，并进行预处理
    # 参数 use_audio_in_video=True 表示在处理视频时也考虑音频信息
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    # 将文本、音频、图像和视频数据组合成模型输入
    # return_tensors="pt" 表示返回 PyTorch 张量，padding=True 表示对输入数据进行填充
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)

    # 将输入数据移动到模型所在的设备，并转换为模型的数据类型
    inputs = inputs.to(model.device).to(model.dtype)

    # 调用模型进行生成任务，返回生成的文本和音频
    # 参数 use_audio_in_video=True 表示在生成过程中考虑音频信息，return_audio=True 表示返回音频结果
    output = model.generate(**inputs, use_audio_in_video=True, return_audio=True)

    # 解码模型输出，提取文本结果
    # skip_special_tokens=True 表示跳过特殊标记，clean_up_tokenization_spaces=False 表示保留空格
    text = processor.batch_decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)

    # 提取音频结果
    audio = output[1]

    # 返回生成的文本和音频结果
    return text, audio

# 语音对话示例一

In [6]:
#audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"
audio_path = "files/guess_age_gender.wav"

#audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
audio = librosa.load(audio_path, sr=16000)[0] # 如果是本地文件路径，直接加载
display(Audio(audio, rate=16000))

## Use a local HuggingFace model to inference.
response = inference(audio_path)
print(response[0][0])
display(Audio(response[1], rate=24000))

Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.
  return F.conv1d(input, weight, bias, self.stride,


system
You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.
user

assistant
Well, I can't really guess your age and gender just from your voice. There are so many factors that can affect how a voice sounds, like the environment you're in, your health, and even how you're feeling at the moment. But if you want to share more about your voice, like if it's high - pitched or low - pitched, that might give me a bit of an idea. So, what can you tell me about your voice?


# 语音对话示例二


In [7]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"

audio = librosa.load(BytesIO(urlopen(audio_path).read()), sr=16000)[0]
display(Audio(audio, rate=16000))

# 调用模型进行推理
response = inference(audio_path)
print(response[0][0])
display(Audio(response[1], rate=24000))

	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.
  return F.conv1d(input, weight, bias, self.stride,


system
You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.
user

assistant
每个人都想被欣赏，所以如果你欣赏某人，就别让它成为秘密。如果还有其他翻译相关的问题，或者别的事，都可以跟我说哦。
