Use two-level dedup in named data store#18351
Conversation
Replace single SHA-256 hash with a two-level approach: 1. Fast fingerprint (length + first 32 bytes) for cheap rejection 2. SHA-256 only when the fingerprint matches, to confirm without full byte comparison For a 35B MoE model with ~29 GB of named data where most buffers are unique, the fingerprint rejects non-matches instantly. SHA-256 is only computed on the rare fingerprint match, avoiding the ~98s cost of hashing everything upfront. Fingerprint collisions are handled by storing a list of candidate buffer indices per fingerprint, so no dedup opportunities are lost. Test plan: - All 12 tests pass in test_named_data_store.py - Added test_fingerprint_collision: same fingerprint, different content produces separate buffers - Added test_fingerprint_collision_with_dedup: after a collision, a true duplicate of an earlier blob still dedupes correctly
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18351
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 5e0e1e6 with merge base 3f33e54 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR updates NamedDataStore deduplication to avoid hashing every buffer up front by introducing a two-level dedup strategy (cheap fingerprint first, SHA-256 only on fingerprint matches). This fits into the EXIR serialization path by reducing time spent preparing large named-data payloads while preserving correct dedup behavior.
Changes:
- Replace single SHA-256-based dedup map with
(len, first 32 bytes)fingerprint buckets and lazy per-buffer SHA-256 computation for confirmation. - Add unit tests to validate correct behavior under fingerprint collisions and post-collision dedup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| exir/_serialize/_named_data_store.py | Implements fingerprint-based candidate lookup and lazy SHA-256 confirmation for dedup. |
| exir/_serialize/test/test_named_data_store.py | Adds regression tests covering fingerprint-collision scenarios and ensuring true duplicates still dedupe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lucylq
left a comment
There was a problem hiding this comment.
Nice, so now we only calculate sha when the fingerprint matches and we go through the candidates list checking _get_buffer_sha256.
I guess it is unavoidable to have byte-to-byte checking at the beginning when we don't calculate sha.
What kind of speed up did you see from this?
| buffer_idx = self.data_hash_to_buffer_idx.get(hashed, -1) | ||
| # Two-level dedup: cheap fingerprint rejects non-matches fast, | ||
| # SHA-256 confirms matches without full byte comparison. | ||
| fingerprint = (len(data), data[:32]) |
Replace single SHA-256 hash with a two-level approach:
full byte comparison
For a 35B MoE model with ~29 GB of named data where most buffers
are unique, the fingerprint rejects non-matches instantly. SHA-256
is only computed on the rare fingerprint match, avoiding the ~98s
cost of hashing everything upfront.
Fingerprint collisions are handled by storing a list of candidate
buffer indices per fingerprint, so no dedup opportunities are lost.
Test plan:
content produces separate buffers
a true duplicate of an earlier blob still dedupes correctly