Skip to content

fix(storage): sanitize vectordb unicode recovery#2103

Merged
zhoujh01 merged 1 commit into
mainfrom
fix/vectordb-unicode-recovery
May 18, 2026
Merged

fix(storage): sanitize vectordb unicode recovery#2103
zhoujh01 merged 1 commit into
mainfrom
fix/vectordb-unicode-recovery

Conversation

@qin-ctx
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx commented May 18, 2026

Description

修复本地 vectordb 在写入和恢复字段 JSON 时对孤立 Unicode surrogate 处理不稳的问题。

本次改动将 session commit 后进入 context collection 的字段数据在写入前统一做 Unicode 清洗:正常中文和 emoji 会原样保留,孤立 surrogate(例如 \ud800)会被替换为 \ufffd,避免后续 JSON/UTF-8/index replay 链路遇到不可编码字符。

同时增强本地 index recovery:恢复时仍优先批量 replay delta;如果某批数据因为历史脏记录失败,会自动降级为逐条 replay,跳过坏记录并打印 warning,避免一条脏数据阻断整个 collection/index 恢复。

举例:

  • 正常数据:{"abstract": "face 😀"} 会继续保留为 face 😀
  • 异常数据:{"abstract": "bad \ud800"} 会清洗为 bad �
  • 历史恢复:如果一批 delta 中只有一条坏记录,其他正常记录仍会继续恢复,服务不会因为该批 replay 失败而整体启动失败

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • 新增 vectordb JSON 安全序列化工具,递归清洗字符串中的孤立 UTF-16 surrogate,并保留合法非 BMP 字符(如 emoji)。
  • 本地 collection fields 写入改为安全 JSON dump,并使用 ensure_ascii=False 避免正常中文/emoji 被拆成 \uXXXX 转义。
  • DataProcessor.convert_fields_for_index() 在 JSON 反序列化后进行 Unicode 清洗,再做字段类型转换和序列化。
  • PersistCollection._recover() 增加批量失败后的逐条 replay 降级逻辑,单条坏 delta 会被跳过并记录 warning。
  • 补充单元测试,覆盖孤立 surrogate 清洗和合法 emoji 保留。

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

测试命令:

.venv/bin/python -m pytest tests/vectordb/test_data_processor.py

额外验证:

  • 手动创建本地 collection + index,写入 bad \ud800 ok 😀,读取结果为 bad � ok 😀
  • 关闭后重新打开 collection,并通过 search_by_vector 验证 index 恢复和搜索正常
  • git diff --check 通过

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

这次修复分为两个层次:

  1. 写入侧避免新脏数据继续进入 vectordb。
  2. 恢复侧容忍已经存在的历史脏 delta,避免单条坏记录影响服务启动。

ensure_ascii=False 本身不是完整修复,因此在序列化前先做 surrogate 清洗;否则孤立 surrogate 仍可能在 UTF-8 编码或底层 index 处理时失败。

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add JSON unicode sanitization for vectordb fields

Relevant files:

  • openviking/storage/vectordb/utils/json_safety.py
  • openviking/storage/vectordb/collection/local_collection.py
  • openviking/storage/vectordb/utils/data_processor.py
  • tests/vectordb/test_data_processor.py

Sub-PR theme: Add batch failure fallback for index delta recovery

Relevant files:

  • openviking/storage/vectordb/collection/local_collection.py

⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@zhoujh01 zhoujh01 merged commit 813db21 into main May 18, 2026
5 checks passed
@zhoujh01 zhoujh01 deleted the fix/vectordb-unicode-recovery branch May 18, 2026 03:58
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants