fix: skill search ranking - use overview for embedding and fix visited set filtering by ZaynJarvis · Pull Request #228 · volcengine/OpenViking

ZaynJarvis · 2026-02-20T03:56:18Z

Problem

After add-skill, ov search returned incorrect/incomplete skill rankings:

Some skills were completely missing from results
Rankings were swapped (e.g. searching "adding memory" returned "searching-context" as [Bug]: search()/find() of skills returns irrelevant and duplicate result #216)
The more relevant a result was, the more likely it got dropped

Root Causes

1. Poor embedding text for skills (`skill_processor.py`)

What: The vectorization text for skills was set to just the short description field from SKILL.md frontmatter — typically one sentence. The LLM-generated overview (a rich multi-paragraph summary) was generated after the vectorize text was set and never used for embedding.

Compare with resources: semantic_processor.py correctly uses overview text for directory vectorization and full content/summary for files. Skills were the only context type embedding with just the short abstract.

Fix: Generate overview first, then use it as the vectorization text — aligning skills with how resources handle directory vectorization.

2. Visited set drops the most relevant results (`hierarchical_retriever.py`)

What: The _recursive_search method conflated two concepts: "visited for directory traversal" and "collected as a search result." This caused the highest-scoring results to be systematically dropped.

Detailed flow (before fix):

The retriever works in stages:

Global vector search finds top-3 non-leaf entries across the entire collection. For a query like "adding memory" against skills, this returns e.g. adding-memory (0.5), searching-context (0.3), openviking (0.2).
Merge starting points combines these with root URIs. Priority queue becomes: [(adding-memory, 0.5), (searching-context, 0.3), (openviking, 0.2), (viking://agent/skills, 0.0)]
Recursive search pops highest-score first:
- Pops adding-memory (0.5) → adds to visited set → searches for children with parent_uri=adding-memory → 0 results (skills have no children)
- Pops searching-context (0.3) → adds to visited → 0 children
- Pops openviking (0.2) → adds to visited → 0 children
- Pops viking://agent/skills (0.0) → searches children → finds all 4 skills
- Old code: if passes_threshold(score) and uri not in visited — three skills are in visited, so they are silently skipped
- Only adding-resource (not in global top-3) gets collected

When it happens: Any time global search returns URIs that are also children of a root/parent directory. This affects all context types (skills, resources, memories) since they share HierarchicalRetriever. Ironically, the most relevant results (highest global scores → processed earliest → visited first) are the ones most likely to be dropped.

Fix: Separate "visited for traversal" from "collected as result." The visited set now only prevents re-entering directories for child search. Results that pass the score threshold are always collected, regardless of visited status.

Testing

Before fix:

ov search "adding memory" -u viking://agent/skills  →  searching-context (wrong, only 1 result)
ov search "search context" -u viking://agent/skills  →  adding-memory (wrong, only 1 result)
ov search "RAG semantic search" -u viking://agent/skills  →  adding-memory (wrong, only 1 result)

After fix:

ov search "adding memory" -u viking://agent/skills
  → adding-memory #1 (0.508) ✓
  → adding-resource #2 (0.241)
  → openviking #3 (0.222)
  → searching-context #4 (0.185)

ov search "search context" -u viking://agent/skills
  → searching-context #1 (0.500) ✓
  → openviking #2 (0.323)
  → adding-resource #3 (0.190)
  → adding-memory #4 (0.163)

ov search "RAG semantic search" -u viking://agent/skills
  → openviking #1 (0.372) ✓
  → searching-context #2 (0.343)
  → adding-resource #3 (0.259)
  → adding-memory #4 (0.141)

All 4 skills appear with correct ranking order.

…d set filtering 1. skill_processor.py: Use LLM-generated overview for vectorization instead of short abstract, aligning with how resources handle directory vectorization in semantic_processor.py. 2. hierarchical_retriever.py: Separate 'visited for traversal' from 'collected as result'. The visited set previously dropped the most relevant results - global search found them first, marked visited, then parent directory search skipped them as children.

MaojiaSheng · 2026-02-20T04:16:47Z

lgtm

…verview Reverts the skill_processor embedding change from volcengine#228 while keeping the retriever fix. Skills should embed using the frontmatter description (abstract), not the LLM-generated overview.

…verview (#229) Reverts the skill_processor embedding change from #228 while keeping the retriever fix. Skills should embed using the frontmatter description (abstract), not the LLM-generated overview.

github-project-automation bot added this to OpenViking project Feb 20, 2026

github-project-automation bot moved this to Backlog in OpenViking project Feb 20, 2026

ZaynJarvis changed the title ~~fix: skill search ranking - use overview for embedding and fix visited set filtering~~ fix: hierarchical retriever visited set drops most relevant search results Feb 20, 2026

ZaynJarvis force-pushed the fix/skill-embedding-ranking branch from ee4532c to b7fe917 Compare February 20, 2026 04:06

ZaynJarvis changed the title ~~fix: hierarchical retriever visited set drops most relevant search results~~ fix: skill search ranking - use overview for embedding and fix visited set filtering Feb 20, 2026

MaojiaSheng approved these changes Feb 20, 2026

View reviewed changes

MaojiaSheng merged commit 3f65a32 into volcengine:main Feb 20, 2026
4 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Feb 20, 2026

ZaynJarvis mentioned this pull request Feb 20, 2026

fix: use frontmatter description for skill vectorization instead of overview #229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: skill search ranking - use overview for embedding and fix visited set filtering#228

fix: skill search ranking - use overview for embedding and fix visited set filtering#228
MaojiaSheng merged 1 commit intovolcengine:mainfrom
ZaynJarvis:fix/skill-embedding-ranking

ZaynJarvis commented Feb 20, 2026 •

edited

Loading

Uh oh!

MaojiaSheng commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ZaynJarvis commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Causes

1. Poor embedding text for skills (skill_processor.py)

2. Visited set drops the most relevant results (hierarchical_retriever.py)

Testing

Uh oh!

MaojiaSheng commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZaynJarvis commented Feb 20, 2026 •

edited

Loading

1. Poor embedding text for skills (`skill_processor.py`)

2. Visited set drops the most relevant results (`hierarchical_retriever.py`)