Skip to content

feat(config): 配置检索打分和 embedding 输入#1770

Merged
zhoujh01 merged 6 commits intomainfrom
feat/retrieval-hotness-alpha-config
Apr 28, 2026
Merged

feat(config): 配置检索打分和 embedding 输入#1770
zhoujh01 merged 6 commits intomainfrom
feat/retrieval-hotness-alpha-config

Conversation

@qin-ctx
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx commented Apr 28, 2026

背景

这个 PR 主要收敛两类配置问题:

  1. 检索分数里默认混入 hotness,导致即使向量完全一致,最终 score 也不一定是 1
  2. 文本资源向量化默认优先摘要,且原文 fallback 缺少可配置的输入长度控制。

变更

  • 新增 retrieval.hotness_alpha,默认 0.0,默认关闭 hotness boost,让最终分数回到纯语义相似度。
  • 新增 retrieval.score_propagation_alpha,默认 0.5,用于配置层级检索里子节点分数和父节点传播分数的混合比例。
  • embedding.text_source 默认改为 content_only,支持 content_only / summary_first / summary_only
  • 新增 embedding.max_input_tokens,默认 4096,用于限制原文内容送入 embedding 前的最大估算 token 数。
  • 原文 embedding 截断只在 openviking/utils/embedding_utils.py 内部做局部估算,不新增公共 token 工具,也不改 Message/Session/Rerank/Bot/Markdown 的既有 token 估算逻辑。
  • 不引入 tiktoken 或 provider-specific tokenizer;这里的 token 是当前 embedding 输入截断场景下的启发式估算。
  • 更新 README、英文/中文配置文档和 examples/ov.conf.example

兼容性说明

  • hotness_alpha=0.0 会关闭默认 hotness 混分,查询分数更接近向量相似度。
  • score_propagation_alpha=0.5 保持原来的父子分数 50/50 混合行为。
  • embedding.text_source 默认从摘要优先变为原文优先,这是为了避免普通文本资源写入后向量化内容和用户原文不一致。
  • max_input_tokens 是估算 token,不承诺等同于具体模型 tokenizer 的精确 token 数。

测试

  • .venv/bin/python -m pytest tests/unit/test_vectorize_file_strategy.py tests/unit/test_embedding_vectorize_strategy.py tests/retrieve/test_hierarchical_retriever_rerank.py::test_score_propagation_alpha_uses_configured_weight tests/retrieve/test_hierarchical_retriever_rerank.py::test_retrieval_hotness_alpha_blends_when_configured tests/retrieve/test_hierarchical_retriever_rerank.py::test_default_retrieval_config_uses_semantic_score_without_hotness tests/test_config_loader.py::test_openviking_config_retrieval_hotness_alpha_defaults_to_zero tests/test_config_loader.py::test_openviking_config_retrieval_alpha_validates_range:16 passed, 4 warnings
  • .venv/bin/python -m ruff check openviking/utils/embedding_utils.py openviking_cli/utils/config/embedding_config.py openviking_cli/utils/config/retrieval_config.py openviking/retrieve/hierarchical_retriever.py tests/unit/test_vectorize_file_strategy.py tests/unit/test_embedding_vectorize_strategy.py tests/retrieve/test_hierarchical_retriever_rerank.py tests/test_config_loader.py:passed
  • git diff --check origin/main..HEAD:passed
  • jq empty examples/ov.conf.example:passed

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@qin-ctx qin-ctx changed the title feat(retrieval): configure hotness score blending feat(config): 配置检索打分和 embedding 输入 Apr 28, 2026
@zhoujh01 zhoujh01 merged commit b35d38a into main Apr 28, 2026
8 checks passed
@zhoujh01 zhoujh01 deleted the feat/retrieval-hotness-alpha-config branch April 28, 2026 11:04
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project Apr 28, 2026
r266-tech added a commit to r266-tech/OpenViking that referenced this pull request Apr 29, 2026
volcengine#1770 refactored the hardcoded `SCORE_PROPAGATION_ALPHA = 0.5` constant
in `openviking/retrieve/hierarchical_retriever.py` into a `RetrievalConfig`
field but set the new field default to 0.0. With the existing formula

    final_score = alpha * embedding_score + (1 - alpha) * parent_score

a default of 0.0 silently drops every child's own embedding score and
keeps only the propagated parent score, changing default ranking
behavior for any caller that does not provide an explicit `retrieval`
block.

The same PR's README, docs/{en,zh}/concepts/07-retrieval.md,
docs/{en,zh}/guides/01-configuration.md and examples/ov.conf.example all
state the default is `0.5` and explicitly note that `0.5 keeps the
existing equal blend`. Restoring the field default to `0.5` aligns the
shipped code with the documented and pre-refactor behavior.
A0nameless0man pushed a commit to A0nameless0man/OpenViking that referenced this pull request Apr 30, 2026
* feat(retrieval): configure hotness score blending

* feat(retrieval): configure score propagation alpha

* test(retrieval): trim redundant propagation coverage

* feat(embedding): centralize token estimation

* fix(embedding): use shared token estimator

* fix(embedding): narrow token truncation scope
sponge225 added a commit to sponge225/OpenViking that referenced this pull request May 6, 2026
PR volcengine#1770 changes:
- RetrievalConfig: configurable hotness_alpha(0.0) and score_propagation_alpha(0.5)
- hierarchical_retriever: skip hotness calc when alpha=0
- viking_fs/core: pass retrieval_config through pipeline
- embedding_utils: token-based truncation (max_input_tokens=4096)
- embedding_config: text_source default to content_only
- open_viking_config: add retrieval field

Other changes:
- benchmark/RAG: nanobot runner, L0/L1 analysis script, pipeline updates
- bot/vikingbot: context and tool factory updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants