feat(embedding): skip re-embedding when content hash is unchanged by mvanhorn · Pull Request #890 · volcengine/OpenViking

mvanhorn · 2026-03-23T06:46:00Z

Description

Add SHA-256 content hash check to vectorize_file() before enqueuing embedding messages. If the text to embed matches a previously stored hash, the embedding API call is skipped. Hashes are stored as sidecar files via AGFS.

Related Issue

Relates to #350

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Added skip_unchanged parameter to vectorize_file() (default: True)
Before enqueuing an embedding message, computes SHA-256 of the text to embed
Reads existing hash from {file_path}.__embed_hash sidecar file
If hash matches, skips the enqueue and decrements the tracker
If different, proceeds with embedding and writes the new hash

Why this matters

When re-indexing resource directories, every file gets re-embedded even if nothing changed. For users with large collections, this wastes embedding API credits on duplicate work. #350 describes batch ingestion pain where @sponge225 reported 4,435 sections triggering rate limits. Skipping unchanged embeddings reduces the number of API calls during re-indexing.

The existing _check_file_content_changed() in semantic_processor.py already skips re-summarization for unchanged files. This extends the same principle to the embedding stage.

The skip_unchanged parameter defaults to True and can be set to False to force re-embedding when needed. All existing callers work without changes.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Post-build dogfooding skipped (score 6/10). Tested via code review and ruff linting only.

This contribution was developed with AI assistance (Claude Code).

Add SHA-256 content hash check to vectorize_file() before enqueuing embedding messages. If the text to embed matches a previously stored hash, the embedding API call is skipped entirely. Hashes are stored as sidecar files in AGFS. This reduces API costs and ingestion time when re-indexing directories where most files haven't changed. Relates to volcengine#350 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-23T06:46:56Z

Failed to generate code suggestions for PR

qin-ctx · 2026-03-23T08:47:01Z

Hi @mvanhorn, thanks for the contribution! The intent to reduce redundant embedding API calls during re-indexing makes total sense, and we appreciate you looking into #350.

After reviewing, we noticed that the current incremental update mechanism (SemanticDagExecutor) already handles this at a higher level — it compares file content before deciding whether to enqueue vectorization tasks. So in the incremental update path, vectorize_file() only receives files that have actually changed, which means the hash check here would rarely trigger.

For reference, here are the PRs related to the existing incremental update mechanism:

Feature/add resource increment #659 — core incremental update implementation
Fix/increment update dir vector store #773 — vector store fix for incremental updates
docs(api): add documentation for incremental update feature #886 — incremental update documentation

Additionally, introducing per-file .__embed_hash sidecar files via AGFS adds storage overhead and system complexity (orphan cleanup, namespace collision, etc.) that we'd prefer to avoid.

If there are specific scenarios where the existing incremental update doesn't work as expected, we'd love to hear more — the better path forward is probably iterating on the incremental update feature itself rather than adding a second caching layer downstream.

If there's nothing else to discuss, we may close this PR for now. Thanks again for the effort!

github-project-automation bot added this to OpenViking project Mar 23, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 23, 2026

qin-ctx self-assigned this Mar 23, 2026

qin-ctx closed this Mar 23, 2026

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedding): skip re-embedding when content hash is unchanged#890

feat(embedding): skip re-embedding when content hash is unchanged#890
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/feat-embedding-content-hash-cache

mvanhorn commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

qin-ctx commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvanhorn commented Mar 23, 2026

Description

Related Issue

Type of Change

Changes Made

Why this matters

Testing

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

qin-ctx commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants