Skip to content

Use two-level dedup in named data store#18351

Merged
mergennachin merged 1 commit intomainfrom
mnachin/fast-data-fingerprint
Mar 20, 2026
Merged

Use two-level dedup in named data store#18351
mergennachin merged 1 commit intomainfrom
mnachin/fast-data-fingerprint

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

Replace single SHA-256 hash with a two-level approach:

  1. Fast fingerprint (length + first 32 bytes) for cheap rejection
  2. SHA-256 only when the fingerprint matches, to confirm without
    full byte comparison

For a 35B MoE model with ~29 GB of named data where most buffers
are unique, the fingerprint rejects non-matches instantly. SHA-256
is only computed on the rare fingerprint match, avoiding the ~98s
cost of hashing everything upfront.

Fingerprint collisions are handled by storing a list of candidate
buffer indices per fingerprint, so no dedup opportunities are lost.

Test plan:

  • All 12 tests pass in test_named_data_store.py
  • Added test_fingerprint_collision: same fingerprint, different
    content produces separate buffers
  • Added test_fingerprint_collision_with_dedup: after a collision,
    a true duplicate of an earlier blob still dedupes correctly

Replace single SHA-256 hash with a two-level approach:
1. Fast fingerprint (length + first 32 bytes) for cheap rejection
2. SHA-256 only when the fingerprint matches, to confirm without
   full byte comparison

For a 35B MoE model with ~29 GB of named data where most buffers
are unique, the fingerprint rejects non-matches instantly. SHA-256
is only computed on the rare fingerprint match, avoiding the ~98s
cost of hashing everything upfront.

Fingerprint collisions are handled by storing a list of candidate
buffer indices per fingerprint, so no dedup opportunities are lost.

Test plan:
- All 12 tests pass in test_named_data_store.py
- Added test_fingerprint_collision: same fingerprint, different
  content produces separate buffers
- Added test_fingerprint_collision_with_dedup: after a collision,
  a true duplicate of an earlier blob still dedupes correctly
Copilot AI review requested due to automatic review settings March 19, 2026 22:31
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18351

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 5e0e1e6 with merge base 3f33e54 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@mergennachin mergennachin requested a review from lucylq March 19, 2026 22:31
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates NamedDataStore deduplication to avoid hashing every buffer up front by introducing a two-level dedup strategy (cheap fingerprint first, SHA-256 only on fingerprint matches). This fits into the EXIR serialization path by reducing time spent preparing large named-data payloads while preserving correct dedup behavior.

Changes:

  • Replace single SHA-256-based dedup map with (len, first 32 bytes) fingerprint buckets and lazy per-buffer SHA-256 computation for confirmation.
  • Add unit tests to validate correct behavior under fingerprint collisions and post-collision dedup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
exir/_serialize/_named_data_store.py Implements fingerprint-based candidate lookup and lazy SHA-256 confirmation for dedup.
exir/_serialize/test/test_named_data_store.py Adds regression tests covering fingerprint-collision scenarios and ensuring true duplicates still dedupe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@lucylq lucylq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so now we only calculate sha when the fingerprint matches and we go through the candidates list checking _get_buffer_sha256.

I guess it is unavoidable to have byte-to-byte checking at the beginning when we don't calculate sha.

What kind of speed up did you see from this?

@mergennachin mergennachin merged commit 7976dc3 into main Mar 20, 2026
146 of 150 checks passed
@mergennachin mergennachin deleted the mnachin/fast-data-fingerprint branch March 20, 2026 03:54
buffer_idx = self.data_hash_to_buffer_idx.get(hashed, -1)
# Two-level dedup: cheap fingerprint rejects non-matches fast,
# SHA-256 confirms matches without full byte comparison.
fingerprint = (len(data), data[:32])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants