Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414

AliveGh0st · 2025-04-26T10:18:33Z

部分实现了 #1343

Motivation

增加对Gemini系列模型音频输入的支持。对于使用Gemini作为主要语言模型的用户来说，添加音频输入功能将大大提高用户体验，尤其是在处理语音消息时。此更新使Gemini能够像处理图片一样处理音频输入，更好地融入多模态交互场景。

Modifications

gemini_source.py 中添加了音频文件处理方法encode_audio_data将音频文件转化为rb二进制数据。返回多种音频格式的MIME格式类型（wav、mp3、ogg等）
在assemble_context方法中增加音频内容处理，使用inline_data格式发送给Gemini
在_prepare_conversation函数以处理音频数据
修改LLMRequestSubStage类，使其能检测到Record类型的音频消息组件
在用户发送语音消息时附加prompt，使模型将其直接视为聊天输入，而不是转写语音内容

Check

[✅] 😊 我的 Commit Message 符合良好的规范
[✅] 👀 我的更改经过良好的测试
[✅] 🤓 我确保没有引入新依赖库，或者引入了新依赖库的同时将其添加到了 requirements.txt 和 pyproject.toml 文件相应位置。
[✅] 😮 我的更改没有引入恶意代码

好的，这是翻译成中文的 pull request 摘要：

Sourcery 总结

为 Gemini 提供程序引入音频输入支持。

新功能：

启用 Gemini 模型处理音频输入，从而实现涉及音频数据的多模式交互。
为纯音频消息添加默认提示，以指示模型将音频视为对话输入，而不是转录。
支持各种音频格式（wav、mp3、ogg、flac、m4a），通过编码它们并确定正确的 MIME 类型。
更新 LLM 请求阶段以检测和处理音频消息（Record 类型）。

增强功能：

更新 ProviderRequest 数据结构，以包含音频文件 URL 列表。
修改 Gemini 源代码，以组装包含音频数据以及文本和图像的请求上下文。

Original summary in English

Summary by Sourcery

Introduce audio input support for the Gemini provider.

New Features:

Enable processing of audio inputs by the Gemini model, allowing multi-modal interactions involving audio data.
Add a default prompt for audio-only messages to instruct the model to treat the audio as conversational input rather than transcribing it.
Support various audio formats (wav, mp3, ogg, flac, m4a) by encoding them and determining the correct MIME type.
Update the LLM request stage to detect and handle audio messages (Record type).

Enhancements:

Update the ProviderRequest data structure to include a list of audio file URLs.
Modify the Gemini source to assemble request contexts containing audio data alongside text and images.

sourcery-ai · 2025-04-26T10:18:42Z

## Sourcery 提供的审查者指南

此 pull request 增加了对 Gemini 模型处理音频输入的支持。它涉及修改 Gemini 源代码适配器以处理音频数据，并更新消息处理管道以检测音频消息并将其包含在发送到模型的请求中。

_由于更改看起来很简单，不需要可视化表示，因此未生成图表。_

### 文件级别更改

| 变更 | 详情 | 文件 |
| ------ | ------- | ----- |
| 实现读取音频文件、确定其 MIME 类型并将其格式化为 Gemini API 的内联数据部分的功能。 | <ul><li>添加 `process_inline_data` 函数来处理像音频这样的内联数据。</li><li>添加 `encode_audio_data` 函数来读取音频文件并确定 MIME 类型。</li><li>修改 `assemble_context` 以接受和处理音频 URL 列表，将其格式化为内联数据。</li><li>更新 `text_chat` 和 `text_chat_stream` 方法以接受和传递 `audio_urls`。</li></ul> | `astrbot/core/provider/sources/gemini_source.py` |
| 修改 LLM 请求处理阶段以检测音频消息组件，并将它们的文件路径包含在 provider 请求中。 | <ul><li>修改 `process` 方法以迭代消息组件。</li><li>添加逻辑以检测 `Record` 类型组件（音频消息）。</li><li>从 `Record` 组件中提取音频文件路径，并将它们添加到 `ProviderRequest`。</li><li>如果消息仅包含音频而没有文本，则添加默认文本提示。</li></ul> | `astrbot/core/pipeline/process_stage/method/llm_request.py` |
| 更新 ProviderRequest 实体以包含音频 URL 的字段。 | <ul><li>向 `ProviderRequest` 数据类添加一个可选的 `audio_urls` 字段。</li></ul> | `astrbot/core/provider/entities.py` |

### 可能相关的 issue

- **#1343**: 此 PR 添加了音频处理，并将音频数据包含在 Gemini 的 API 请求上下文中，从而实现了该 issue 的功能。

---

<details>
<summary>提示和命令</summary>

#### 与 Sourcery 互动

- **触发新的审查：** 在 pull request 上评论 `@sourcery-ai review`。
- **继续讨论：** 直接回复 Sourcery 的审查评论。
- **从审查评论生成 GitHub issue：** 通过回复审查评论，让 Sourcery 从该评论创建一个 issue。您也可以回复审查评论并使用 `@sourcery-ai issue` 从该评论创建一个 issue。
- **生成 pull request 标题：** 在 pull request 标题中的任何位置写入 `@sourcery-ai` 以随时生成标题。您也可以在 pull request 上评论 `@sourcery-ai title` 以随时（重新）生成标题。
- **生成 pull request 摘要：** 在 pull request 正文中的任何位置写入 `@sourcery-ai summary` 以随时在您想要的位置生成 PR 摘要。您也可以在 pull request 上评论 `@sourcery-ai summary` 以随时（重新）生成摘要。
- **生成审查者指南：** 在 pull request 上评论 `@sourcery-ai guide` 以随时（重新）生成审查者指南。
- **解决所有 Sourcery 评论：** 在 pull request 上评论 `@sourcery-ai resolve` 以解决所有 Sourcery 评论。如果您已经处理了所有评论并且不想再看到它们，这将非常有用。
- **驳回所有 Sourcery 审查：** 在 pull request 上评论 `@sourcery-ai dismiss` 以驳回所有现有的 Sourcery 审查。如果您想重新开始新的审查，这将特别有用 - 不要忘记评论 `@sourcery-ai review` 以触发新的审查！

#### 自定义您的体验

访问您的 [仪表板](https://app.sourcery.ai) 以：
- 启用或禁用审查功能，例如 Sourcery 生成的 pull request 摘要、审查者指南等。
- 更改审查语言。
- 添加、删除或编辑自定义审查说明。
- 调整其他审查设置。

#### 获取帮助

- [联系我们的支持团队](mailto:support@sourcery.ai) 提出问题或反馈。
- 访问我们的 [文档](https://docs.sourcery.ai) 获取详细的指南和信息。
- 通过在 [X/Twitter](https://x.com/SourceryAI), [LinkedIn](https://www.linkedin.com/company/sourcery-ai/) 或 [GitHub](https://github.com/sourcery-ai) 上关注我们，与 Sourcery 团队保持联系。

</details>

Original review guide in English

Reviewer's Guide by Sourcery

This pull request adds support for processing audio inputs for the Gemini model. It involves modifying the Gemini source adapter to handle audio data and updating the message processing pipeline to detect and include audio messages in the request sent to the model.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change	Details	Files
Implement functions to read audio files, determine their MIME type, and format them as inline data parts for the Gemini API.	Add `process_inline_data` function to handle inline data like audio. Add `encode_audio_data` function to read audio files and determine MIME type. Modify `assemble_context` to accept and process a list of audio URLs, formatting them as inline data. Update `text_chat` and `text_chat_stream` methods to accept and pass `audio_urls`.	`astrbot/core/provider/sources/gemini_source.py`
Modify the LLM request processing stage to detect audio message components and include their file paths in the provider request.	Modify the `process` method to iterate through message components. Add logic to detect `Record` type components (audio messages). Extract audio file paths from `Record` components and add them to the `ProviderRequest`. Add a default text prompt if the message contains only audio and no text.	`astrbot/core/pipeline/process_stage/method/llm_request.py`
Update the ProviderRequest entity to include a field for audio URLs.	Add an optional `audio_urls` field to the `ProviderRequest` dataclass.	`astrbot/core/provider/entities.py`

Possibly linked issues

[Feature] 希望实现用户音频输入的支持 #1343: The PR adds audio handling and includes audio data in the API request context for Gemini, implementing the issue's feature.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

嘿 @AliveGh0st - 我已经查看了你的更改 - 这里有一些反馈：

总体评论：

考虑使仅存在音频时添加的默认提示可配置，而不是在 llm_request.py 中对其进行硬编码。
assemble_context 中处理图像 URL 和音频 URL 的逻辑具有相似之处；探索重构为更通用的媒体处理函数以减少重复。

以下是我在审查期间查看的内容

🟢 一般问题：一切看起来都不错
🟢 安全性：一切看起来都不错
🟢 测试：一切看起来都不错
🟡 复杂性：发现 1 个问题
🟢 文档：一切看起来都不错

Sourcery 对开源是免费的 - 如果你喜欢我们的评论，请考虑分享它们 ✨

_{帮助我变得更有用！请点击每个评论上的 👍 或 👎，我将使用反馈来改进你的评论。}

Original comment in English

Hey @AliveGh0st - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider making the default prompt added when only audio is present configurable instead of hardcoding it in llm_request.py.
The logic for processing image URLs and audio URLs in assemble_context shares similarities; explore refactoring into a more generic media handling function to reduce duplication.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

astrbot/core/provider/sources/gemini_source.py

astrbot/core/pipeline/process_stage/method/llm_request.py

Soulter · 2025-04-26T13:48:03Z

此外目前音频数据没有持久化保存下来，因此第二次请求的时候不会附带历史的音频信息。这个感觉可以研究一下该怎么存。

astrbot/core/provider/sources/gemini_source.py

anka-afk

支持, 话说是不是可以加一个设置, 启用stt来把音频存入上下文?

anka-afk · 2025-04-29T04:16:38Z

astrbot/core/provider/sources/gemini_source.py

+        except Exception as e:
+            logger.error(f"音频文件处理失败: {e}")
+            return None, None
+


refactor一下:

async def encode_audio_data(self, audio_url: str) -> tuple[Optional[bytes], Optional[str]]: """ 读取音频文件并返回二进制数据 Args: audio_url (str): 音频文件路径 Returns: tuple: (音频二进制数据, MIME类型) """ # 推断 MIME 类型 mime_type = mimetypes.guess_type(audio_url)[0] if not mime_type: extension_to_mime = { ".wav": "audio/wav", ".mp3": "audio/mpeg", ".ogg": "audio/ogg", ".flac": "audio/flac", ".m4a": "audio/mp4", } extension = os.path.splitext(audio_url)[1].lower() mime_type = extension_to_mime.get(extension, "application/octet-stream") try: # 直接读取文件二进制数据 with open(audio_url, "rb") as f: audio_bytes = f.read() logger.info(f"音频文件处理成功: {audio_url}，mime类型: {mime_type}，大小: {len(audio_bytes)} 字节")

anka-afk · 2025-04-29T04:18:30Z

astrbot/core/pipeline/process_stage/method/llm_request.py

+
+            # 如果只有音频没有文本，添加默认文本
+            if not req.prompt and has_audio:
+                req.prompt = "[用户发送的音频将其视为文本输入与其进行聊天]"


refactor一下:

# 处理消息中的图片和音频 for comp in event.message_obj.message: if isinstance(comp, Image): # 处理图片消息 image_path = await comp.convert_to_file_path() req.image_urls.append(image_path) elif isinstance(comp, Record): # 处理音频消息 audio_path = await comp.convert_to_file_path() req.audio_urls.append(audio_path) # 如果只有音频没有文本，添加默认文本 if not req.prompt and req.audio_urls: req.prompt = "[用户发送的音频将其视为文本输入与其进行聊天]"

Soulter · 2025-05-01T15:16:59Z

发现一个较为严重的问题，有些模型是不支持语音输入的，而当前的更改在消息链中带有 Record 时会自动传入给模型。

在这一块是否需要专门对多模态模型（包括后面也会适配的全模态的 Qwen2.5-Omni 模型）单独进行适配？

@Raven95676

Raven95676 · 2025-05-01T16:01:40Z

发现一个较为严重的问题，有些模型是不支持语音输入的，而当前的更改在消息链中带有 Record 时会自动传入给模型。

在这一块是否需要专门对多模态模型（包括后面也会适配的全模态的 Qwen2.5-Omni 模型）单独进行适配？

@Raven95676

感觉确实需要，但是我这能发语音的测试环境只有napcat（

AliveGh0st · 2025-05-01T17:28:11Z

发现一个较为严重的问题，有些模型是不支持语音输入的，而当前的更改在消息链中带有 Record 时会自动传入给模型。

在这一块是否需要专门对多模态模型（包括后面也会适配的全模态的 Qwen2.5-Omni 模型）单独进行适配？

@Raven95676

那服务提供商配置里加个总的开关？手动开启

✨feat: 添加对Gemini模型的音频处理支持，更新 ProviderRequest 以包含音频 URL 列表

a3d469d

sourcery-ai bot reviewed Apr 26, 2025

View reviewed changes

astrbot/core/provider/sources/gemini_source.py Outdated Show resolved Hide resolved

Soulter reviewed Apr 26, 2025

View reviewed changes

astrbot/core/pipeline/process_stage/method/llm_request.py Outdated Show resolved Hide resolved

Soulter approved these changes Apr 26, 2025

View reviewed changes

Soulter requested review from anka-afk and Raven95676 April 26, 2025 13:48

Soulter force-pushed the gemini_audio_support branch from bfca546 to a3d469d Compare April 26, 2025 14:55

🐛fix: 移除对req对象audio_urls属性无用的判断

da0eb2a

Raven95676 requested changes Apr 27, 2025

View reviewed changes

astrbot/core/provider/sources/gemini_source.py Outdated Show resolved Hide resolved

astrbot/core/provider/sources/gemini_source.py Outdated Show resolved Hide resolved

anka-afk approved these changes Apr 29, 2025

View reviewed changes

Raven95676 self-requested a review April 29, 2025 04:51

Raven95676 approved these changes Apr 29, 2025

View reviewed changes

AliveGh0st and others added 2 commits April 29, 2025 16:31

fix: 将音频上下文正确组装为OpenAI格式base64数据

7ac151a

🎈 perf: 完善接口定义

83bc1c5

Soulter changed the title ~~✨feat: 初步添加对Gemini模型的音频处理支持，更新 ProviderRequest 以包含音频 URL 列表~~ Feature 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 May 1, 2025

Soulter changed the title ~~Feature 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持~~ Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414

Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414

Uh oh!

AliveGh0st commented Apr 26, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Apr 26, 2025 •

edited

Loading

Reviewer's Guide by Sourcery

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Soulter commented Apr 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

anka-afk left a comment

Uh oh!

anka-afk Apr 29, 2025

Uh oh!

anka-afk Apr 29, 2025

Uh oh!

Soulter commented May 1, 2025 •

edited

Loading

Uh oh!

Raven95676 commented May 1, 2025

Uh oh!

AliveGh0st commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414

Are you sure you want to change the base?

Feature: 添加对 Gemini 模型的音频模态和 Qwen2.5-Omni 全模态支持 #1414

Uh oh!

Conversation

AliveGh0st commented Apr 26, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Check

Sourcery 总结

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide by Sourcery

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Soulter commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anka-afk left a comment

Choose a reason for hiding this comment

Uh oh!

anka-afk Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

anka-afk Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Soulter commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Raven95676 commented May 1, 2025

Uh oh!

AliveGh0st commented May 1, 2025

Uh oh!

Uh oh!

AliveGh0st commented Apr 26, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Apr 26, 2025 •

edited

Loading

Soulter commented Apr 26, 2025 •

edited

Loading

Soulter commented May 1, 2025 •

edited

Loading