-
-
Notifications
You must be signed in to change notification settings - Fork 676
Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Original review guide in EnglishReviewer's Guide by SourceryThis pull request adds support for processing audio inputs for the Gemini model. It involves modifying the Gemini source adapter to handle audio data and updating the message processing pipeline to detect and include audio messages in the request sent to the model. No diagrams generated as the changes look simple and do not need a visual representation. File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嘿 @AliveGh0st - 我已经查看了你的更改 - 这里有一些反馈:
总体评论:
- 考虑使仅存在音频时添加的默认提示可配置,而不是在
llm_request.py
中对其进行硬编码。 assemble_context
中处理图像 URL 和音频 URL 的逻辑具有相似之处;探索重构为更通用的媒体处理函数以减少重复。
以下是我在审查期间查看的内容
- 🟢 一般问题:一切看起来都不错
- 🟢 安全性:一切看起来都不错
- 🟢 测试:一切看起来都不错
- 🟡 复杂性:发现 1 个问题
- 🟢 文档:一切看起来都不错
帮助我变得更有用!请点击每个评论上的 👍 或 👎,我将使用反馈来改进你的评论。
Original comment in English
Hey @AliveGh0st - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider making the default prompt added when only audio is present configurable instead of hardcoding it in
llm_request.py
. - The logic for processing image URLs and audio URLs in
assemble_context
shares similarities; explore refactoring into a more generic media handling function to reduce duplication.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
此外目前音频数据没有持久化保存下来,因此第二次请求的时候不会附带历史的音频信息。这个感觉可以研究一下该怎么存。 |
bfca546
to
a3d469d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
支持, 话说是不是可以加一个设置, 启用stt来把音频存入上下文?
except Exception as e: | ||
logger.error(f"音频文件处理失败: {e}") | ||
return None, None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor一下:
async def encode_audio_data(self, audio_url: str) -> tuple[Optional[bytes], Optional[str]]:
"""
读取音频文件并返回二进制数据
Args:
audio_url (str): 音频文件路径
Returns:
tuple: (音频二进制数据, MIME类型)
"""
# 推断 MIME 类型
mime_type = mimetypes.guess_type(audio_url)[0]
if not mime_type:
extension_to_mime = {
".wav": "audio/wav",
".mp3": "audio/mpeg",
".ogg": "audio/ogg",
".flac": "audio/flac",
".m4a": "audio/mp4",
}
extension = os.path.splitext(audio_url)[1].lower()
mime_type = extension_to_mime.get(extension, "application/octet-stream")
try:
# 直接读取文件二进制数据
with open(audio_url, "rb") as f:
audio_bytes = f.read()
logger.info(f"音频文件处理成功: {audio_url},mime类型: {mime_type},大小: {len(audio_bytes)} 字节")
|
||
# 如果只有音频没有文本,添加默认文本 | ||
if not req.prompt and has_audio: | ||
req.prompt = "[用户发送的音频将其视为文本输入与其进行聊天]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor一下:
# 处理消息中的图片和音频
for comp in event.message_obj.message:
if isinstance(comp, Image):
# 处理图片消息
image_path = await comp.convert_to_file_path()
req.image_urls.append(image_path)
elif isinstance(comp, Record):
# 处理音频消息
audio_path = await comp.convert_to_file_path()
req.audio_urls.append(audio_path)
# 如果只有音频没有文本,添加默认文本
if not req.prompt and req.audio_urls:
req.prompt = "[用户发送的音频将其视为文本输入与其进行聊天]"
发现一个较为严重的问题,有些模型是不支持语音输入的,而当前的更改在消息链中带有 Record 时会自动传入给模型。 在这一块是否需要专门对多模态模型(包括后面也会适配的全模态的 Qwen2.5-Omni 模型)单独进行适配? |
感觉确实需要,但是我这能发语音的测试环境只有napcat( |
那服务提供商配置里加个总的开关?手动开启 |
部分实现了 #1343
Motivation
增加对Gemini系列模型音频输入的支持。对于使用Gemini作为主要语言模型的用户来说,添加音频输入功能将大大提高用户体验,尤其是在处理语音消息时。此更新使Gemini能够像处理图片一样处理音频输入,更好地融入多模态交互场景。
Modifications
gemini_source.py
中添加了音频文件处理方法encode_audio_data
将音频文件转化为rb二进制数据。返回多种音频格式的MIME格式类型(wav、mp3、ogg等)在
assemble_context
方法中增加音频内容处理,使用inline_data
格式发送给Gemini在
_prepare_conversation
函数以处理音频数据修改
LLMRequestSubStage
类,使其能检测到Record
类型的音频消息组件在用户发送语音消息时附加prompt,使模型将其直接视为聊天输入,而不是转写语音内容
Check
requirements.txt
和pyproject.toml
文件相应位置。好的,这是翻译成中文的 pull request 摘要:
Sourcery 总结
为 Gemini 提供程序引入音频输入支持。
新功能:
Record
类型)。增强功能:
ProviderRequest
数据结构,以包含音频文件 URL 列表。Original summary in English
Summary by Sourcery
Introduce audio input support for the Gemini provider.
New Features:
Record
type).Enhancements:
ProviderRequest
data structure to include a list of audio file URLs.