fix(storage): sanitize vectordb unicode recovery by qin-ctx · Pull Request #2103 · volcengine/OpenViking

qin-ctx · 2026-05-18T03:52:24Z

Description

修复本地 vectordb 在写入和恢复字段 JSON 时对孤立 Unicode surrogate 处理不稳的问题。

本次改动将 session commit 后进入 context collection 的字段数据在写入前统一做 Unicode 清洗：正常中文和 emoji 会原样保留，孤立 surrogate（例如 \ud800）会被替换为 \ufffd，避免后续 JSON/UTF-8/index replay 链路遇到不可编码字符。

同时增强本地 index recovery：恢复时仍优先批量 replay delta；如果某批数据因为历史脏记录失败，会自动降级为逐条 replay，跳过坏记录并打印 warning，避免一条脏数据阻断整个 collection/index 恢复。

举例：

正常数据：{"abstract": "face 😀"} 会继续保留为 face 😀
异常数据：{"abstract": "bad \ud800"} 会清洗为 bad �
历史恢复：如果一批 delta 中只有一条坏记录，其他正常记录仍会继续恢复，服务不会因为该批 replay 失败而整体启动失败

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

新增 vectordb JSON 安全序列化工具，递归清洗字符串中的孤立 UTF-16 surrogate，并保留合法非 BMP 字符（如 emoji）。
本地 collection fields 写入改为安全 JSON dump，并使用 ensure_ascii=False 避免正常中文/emoji 被拆成 \uXXXX 转义。
DataProcessor.convert_fields_for_index() 在 JSON 反序列化后进行 Unicode 清洗，再做字段类型转换和序列化。
PersistCollection._recover() 增加批量失败后的逐条 replay 降级逻辑，单条坏 delta 会被跳过并记录 warning。
补充单元测试，覆盖孤立 surrogate 清洗和合法 emoji 保留。

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

测试命令：

.venv/bin/python -m pytest tests/vectordb/test_data_processor.py

额外验证：

手动创建本地 collection + index，写入 bad \ud800 ok 😀，读取结果为 bad � ok 😀
关闭后重新打开 collection，并通过 search_by_vector 验证 index 恢复和搜索正常
git diff --check 通过

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

这次修复分为两个层次：

写入侧避免新脏数据继续进入 vectordb。
恢复侧容忍已经存在的历史脏 delta，避免单条坏记录影响服务启动。

ensure_ascii=False 本身不是完整修复，因此在序列化前先做 surrogate 清洗；否则孤立 surrogate 仍可能在 UTF-8 编码或底层 index 处理时失败。

github-actions · 2026-05-18T03:53:27Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add JSON unicode sanitization for vectordb fields Relevant files: openviking/storage/vectordb/utils/json_safety.py openviking/storage/vectordb/collection/local_collection.py openviking/storage/vectordb/utils/data_processor.py tests/vectordb/test_data_processor.py Sub-PR theme: Add batch failure fallback for index delta recovery Relevant files: openviking/storage/vectordb/collection/local_collection.py
⚡ No major issues detected

github-actions · 2026-05-18T03:54:39Z

PR Code Suggestions ✨

No code suggestions found for the PR.

fix(storage): sanitize vectordb unicode recovery

282bad7

github-project-automation Bot added this to OpenViking project May 18, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 18, 2026

zhoujh01 approved these changes May 18, 2026

View reviewed changes

zhoujh01 merged commit 813db21 into main May 18, 2026
5 checks passed

zhoujh01 deleted the fix/vectordb-unicode-recovery branch May 18, 2026 03:58

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): sanitize vectordb unicode recovery#2103

fix(storage): sanitize vectordb unicode recovery#2103
zhoujh01 merged 1 commit into
mainfrom
fix/vectordb-unicode-recovery

qin-ctx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qin-ctx commented May 18, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented May 18, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 18, 2026

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants