fix: skill search ranking - use overview for embedding and fix visited set filtering#228
Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom Feb 20, 2026
Conversation
…d set filtering 1. skill_processor.py: Use LLM-generated overview for vectorization instead of short abstract, aligning with how resources handle directory vectorization in semantic_processor.py. 2. hierarchical_retriever.py: Separate 'visited for traversal' from 'collected as result'. The visited set previously dropped the most relevant results - global search found them first, marked visited, then parent directory search skipped them as children.
ee4532c to
b7fe917
Compare
Collaborator
|
lgtm |
MaojiaSheng
approved these changes
Feb 20, 2026
ZaynJarvis
added a commit
to ZaynJarvis/OpenViking
that referenced
this pull request
Feb 20, 2026
…verview Reverts the skill_processor embedding change from volcengine#228 while keeping the retriever fix. Skills should embed using the frontmatter description (abstract), not the LLM-generated overview.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After
add-skill,ov searchreturned incorrect/incomplete skill rankings:Root Causes
1. Poor embedding text for skills (
skill_processor.py)What: The vectorization text for skills was set to just the short
descriptionfield from SKILL.md frontmatter — typically one sentence. The LLM-generated overview (a rich multi-paragraph summary) was generated after the vectorize text was set and never used for embedding.Compare with resources:
semantic_processor.pycorrectly usesoverviewtext for directory vectorization and fullcontent/summaryfor files. Skills were the only context type embedding with just the short abstract.Fix: Generate overview first, then use it as the vectorization text — aligning skills with how resources handle directory vectorization.
2. Visited set drops the most relevant results (
hierarchical_retriever.py)What: The
_recursive_searchmethod conflated two concepts: "visited for directory traversal" and "collected as a search result." This caused the highest-scoring results to be systematically dropped.Detailed flow (before fix):
The retriever works in stages:
Global vector search finds top-3 non-leaf entries across the entire collection. For a query like "adding memory" against skills, this returns e.g.
adding-memory(0.5),searching-context(0.3),openviking(0.2).Merge starting points combines these with root URIs. Priority queue becomes:
[(adding-memory, 0.5), (searching-context, 0.3), (openviking, 0.2), (viking://agent/skills, 0.0)]Recursive search pops highest-score first:
adding-memory(0.5) → adds tovisitedset → searches for children withparent_uri=adding-memory→ 0 results (skills have no children)searching-context(0.3) → adds tovisited→ 0 childrenopenviking(0.2) → adds tovisited→ 0 childrenviking://agent/skills(0.0) → searches children → finds all 4 skillsif passes_threshold(score) and uri not in visited— three skills are invisited, so they are silently skippedadding-resource(not in global top-3) gets collectedWhen it happens: Any time global search returns URIs that are also children of a root/parent directory. This affects all context types (skills, resources, memories) since they share
HierarchicalRetriever. Ironically, the most relevant results (highest global scores → processed earliest → visited first) are the ones most likely to be dropped.Fix: Separate "visited for traversal" from "collected as result." The
visitedset now only prevents re-entering directories for child search. Results that pass the score threshold are always collected, regardless of visited status.Testing
Before fix:
After fix:
All 4 skills appear with correct ranking order.