Use two-level dedup in named data store by mergennachin · Pull Request #18351 · pytorch/executorch

mergennachin · 2026-03-19T22:31:51Z

Replace single SHA-256 hash with a two-level approach:

Fast fingerprint (length + first 32 bytes) for cheap rejection
SHA-256 only when the fingerprint matches, to confirm without
full byte comparison

For a 35B MoE model with ~29 GB of named data where most buffers
are unique, the fingerprint rejects non-matches instantly. SHA-256
is only computed on the rare fingerprint match, avoiding the ~98s
cost of hashing everything upfront.

Fingerprint collisions are handled by storing a list of candidate
buffer indices per fingerprint, so no dedup opportunities are lost.

Test plan:

All 12 tests pass in test_named_data_store.py
Added test_fingerprint_collision: same fingerprint, different
content produces separate buffers
Added test_fingerprint_collision_with_dedup: after a collision,
a true duplicate of an earlier blob still dedupes correctly

Replace single SHA-256 hash with a two-level approach: 1. Fast fingerprint (length + first 32 bytes) for cheap rejection 2. SHA-256 only when the fingerprint matches, to confirm without full byte comparison For a 35B MoE model with ~29 GB of named data where most buffers are unique, the fingerprint rejects non-matches instantly. SHA-256 is only computed on the rare fingerprint match, avoiding the ~98s cost of hashing everything upfront. Fingerprint collisions are handled by storing a list of candidate buffer indices per fingerprint, so no dedup opportunities are lost. Test plan: - All 12 tests pass in test_named_data_store.py - Added test_fingerprint_collision: same fingerprint, different content produces separate buffers - Added test_fingerprint_collision_with_dedup: after a collision, a true duplicate of an earlier blob still dedupes correctly

pytorch-bot · 2026-03-19T22:31:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18351

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 5e0e1e6 with merge base 3f33e54 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-19T22:32:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR updates NamedDataStore deduplication to avoid hashing every buffer up front by introducing a two-level dedup strategy (cheap fingerprint first, SHA-256 only on fingerprint matches). This fits into the EXIR serialization path by reducing time spent preparing large named-data payloads while preserving correct dedup behavior.

Changes:

Replace single SHA-256-based dedup map with (len, first 32 bytes) fingerprint buckets and lazy per-buffer SHA-256 computation for confirmation.
Add unit tests to validate correct behavior under fingerprint collisions and post-collision dedup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
exir/_serialize/_named_data_store.py	Implements fingerprint-based candidate lookup and lazy SHA-256 confirmation for dedup.
exir/_serialize/test/test_named_data_store.py	Adds regression tests covering fingerprint-collision scenarios and ensuring true duplicates still dedupe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lucylq

Nice, so now we only calculate sha when the fingerprint matches and we go through the candidates list checking _get_buffer_sha256.

I guess it is unavoidable to have byte-to-byte checking at the beginning when we don't calculate sha.

What kind of speed up did you see from this?

digantdesai · 2026-03-20T16:23:29Z

-            buffer_idx = self.data_hash_to_buffer_idx.get(hashed, -1)
+            # Two-level dedup: cheap fingerprint rejects non-matches fast,
+            # SHA-256 confirms matches without full byte comparison.
+            fingerprint = (len(data), data[:32])


mergennachin requested review from JacobSzwejbka and larryliu0820 as code owners March 19, 2026 22:31

Copilot AI review requested due to automatic review settings March 19, 2026 22:31

mergennachin requested a review from lucylq March 19, 2026 22:31

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026

Copilot started reviewing on behalf of mergennachin March 19, 2026 22:32 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

lucylq approved these changes Mar 19, 2026

View reviewed changes

mergennachin merged commit 7976dc3 into main Mar 20, 2026
146 of 150 checks passed

mergennachin deleted the mnachin/fast-data-fingerprint branch March 20, 2026 03:54

digantdesai reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use two-level dedup in named data store#18351

Use two-level dedup in named data store#18351
mergennachin merged 1 commit intomainfrom
mnachin/fast-data-fingerprint

mergennachin commented Mar 19, 2026

Uh oh!

pytorch-bot Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lucylq left a comment

Uh oh!

Uh oh!

digantdesai Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mergennachin commented Mar 19, 2026

Uh oh!

pytorch-bot Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18351

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions Bot commented Mar 19, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lucylq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

digantdesai Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot Bot commented Mar 19, 2026 •

edited

Loading

This PR needs a `release notes:` label