Skip to content

feat(embedding): skip re-embedding when content hash is unchanged#890

Closed
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/feat-embedding-content-hash-cache
Closed

feat(embedding): skip re-embedding when content hash is unchanged#890
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/feat-embedding-content-hash-cache

Conversation

@mvanhorn
Copy link
Contributor

Description

Add SHA-256 content hash check to vectorize_file() before enqueuing embedding messages. If the text to embed matches a previously stored hash, the embedding API call is skipped. Hashes are stored as sidecar files via AGFS.

Related Issue

Relates to #350

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added skip_unchanged parameter to vectorize_file() (default: True)
  • Before enqueuing an embedding message, computes SHA-256 of the text to embed
  • Reads existing hash from {file_path}.__embed_hash sidecar file
  • If hash matches, skips the enqueue and decrements the tracker
  • If different, proceeds with embedding and writes the new hash

Why this matters

When re-indexing resource directories, every file gets re-embedded even if nothing changed. For users with large collections, this wastes embedding API credits on duplicate work. #350 describes batch ingestion pain where @sponge225 reported 4,435 sections triggering rate limits. Skipping unchanged embeddings reduces the number of API calls during re-indexing.

The existing _check_file_content_changed() in semantic_processor.py already skips re-summarization for unchanged files. This extends the same principle to the embedding stage.

The skip_unchanged parameter defaults to True and can be set to False to force re-embedding when needed. All existing callers work without changes.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Post-build dogfooding skipped (score 6/10). Tested via code review and ruff linting only.

This contribution was developed with AI assistance (Claude Code).

Add SHA-256 content hash check to vectorize_file() before enqueuing
embedding messages. If the text to embed matches a previously stored
hash, the embedding API call is skipped entirely. Hashes are stored
as sidecar files in AGFS.

This reduces API costs and ingestion time when re-indexing directories
where most files haven't changed.

Relates to volcengine#350

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Failed to generate code suggestions for PR

@qin-ctx
Copy link
Collaborator

qin-ctx commented Mar 23, 2026

Hi @mvanhorn, thanks for the contribution! The intent to reduce redundant embedding API calls during re-indexing makes total sense, and we appreciate you looking into #350.

After reviewing, we noticed that the current incremental update mechanism (SemanticDagExecutor) already handles this at a higher level — it compares file content before deciding whether to enqueue vectorization tasks. So in the incremental update path, vectorize_file() only receives files that have actually changed, which means the hash check here would rarely trigger.

For reference, here are the PRs related to the existing incremental update mechanism:

Additionally, introducing per-file .__embed_hash sidecar files via AGFS adds storage overhead and system complexity (orphan cleanup, namespace collision, etc.) that we'd prefer to avoid.

If there are specific scenarios where the existing incremental update doesn't work as expected, we'd love to hear more — the better path forward is probably iterating on the incremental update feature itself rather than adding a second caching layer downstream.

If there's nothing else to discuss, we may close this PR for now. Thanks again for the effort!

@qin-ctx qin-ctx self-assigned this Mar 23, 2026
@qin-ctx qin-ctx closed this Mar 23, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants