fix(storage): sanitize vectordb unicode recovery#2103
Merged
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
zhoujh01
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
修复本地 vectordb 在写入和恢复字段 JSON 时对孤立 Unicode surrogate 处理不稳的问题。
本次改动将
session commit后进入 context collection 的字段数据在写入前统一做 Unicode 清洗:正常中文和 emoji 会原样保留,孤立 surrogate(例如\ud800)会被替换为\ufffd,避免后续 JSON/UTF-8/index replay 链路遇到不可编码字符。同时增强本地 index recovery:恢复时仍优先批量 replay delta;如果某批数据因为历史脏记录失败,会自动降级为逐条 replay,跳过坏记录并打印 warning,避免一条脏数据阻断整个 collection/index 恢复。
举例:
{"abstract": "face 😀"}会继续保留为face 😀{"abstract": "bad \ud800"}会清洗为bad �Related Issue
N/A
Type of Change
Changes Made
ensure_ascii=False避免正常中文/emoji 被拆成\uXXXX转义。DataProcessor.convert_fields_for_index()在 JSON 反序列化后进行 Unicode 清洗,再做字段类型转换和序列化。PersistCollection._recover()增加批量失败后的逐条 replay 降级逻辑,单条坏 delta 会被跳过并记录 warning。Testing
测试命令:
额外验证:
bad \ud800 ok 😀,读取结果为bad � ok 😀search_by_vector验证 index 恢复和搜索正常git diff --check通过Checklist
Screenshots (if applicable)
N/A
Additional Notes
这次修复分为两个层次:
ensure_ascii=False本身不是完整修复,因此在序列化前先做 surrogate 清洗;否则孤立 surrogate 仍可能在 UTF-8 编码或底层 index 处理时失败。