feat(embedding): skip re-embedding when content hash is unchanged#890
feat(embedding): skip re-embedding when content hash is unchanged#890mvanhorn wants to merge 1 commit intovolcengine:mainfrom
Conversation
Add SHA-256 content hash check to vectorize_file() before enqueuing embedding messages. If the text to embed matches a previously stored hash, the embedding API call is skipped entirely. Hashes are stored as sidecar files in AGFS. This reduces API costs and ingestion time when re-indexing directories where most files haven't changed. Relates to volcengine#350 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Failed to generate code suggestions for PR |
|
Hi @mvanhorn, thanks for the contribution! The intent to reduce redundant embedding API calls during re-indexing makes total sense, and we appreciate you looking into #350. After reviewing, we noticed that the current incremental update mechanism ( For reference, here are the PRs related to the existing incremental update mechanism:
Additionally, introducing per-file If there are specific scenarios where the existing incremental update doesn't work as expected, we'd love to hear more — the better path forward is probably iterating on the incremental update feature itself rather than adding a second caching layer downstream. If there's nothing else to discuss, we may close this PR for now. Thanks again for the effort! |
Description
Add SHA-256 content hash check to
vectorize_file()before enqueuing embedding messages. If the text to embed matches a previously stored hash, the embedding API call is skipped. Hashes are stored as sidecar files via AGFS.Related Issue
Relates to #350
Type of Change
Changes Made
skip_unchangedparameter tovectorize_file()(default:True){file_path}.__embed_hashsidecar fileWhy this matters
When re-indexing resource directories, every file gets re-embedded even if nothing changed. For users with large collections, this wastes embedding API credits on duplicate work. #350 describes batch ingestion pain where @sponge225 reported 4,435 sections triggering rate limits. Skipping unchanged embeddings reduces the number of API calls during re-indexing.
The existing
_check_file_content_changed()insemantic_processor.pyalready skips re-summarization for unchanged files. This extends the same principle to the embedding stage.The
skip_unchangedparameter defaults toTrueand can be set toFalseto force re-embedding when needed. All existing callers work without changes.Testing
Post-build dogfooding skipped (score 6/10). Tested via code review and ruff linting only.
This contribution was developed with AI assistance (Claude Code).