Skip to content

Prefetch mmap'd weight blobs to eliminate page fault bottleneck#18236

Merged
mergennachin merged 1 commit into
mainfrom
metal-prefetch-weights
Mar 17, 2026
Merged

Prefetch mmap'd weight blobs to eliminate page fault bottleneck#18236
mergennachin merged 1 commit into
mainfrom
metal-prefetch-weights

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

@mergennachin mergennachin commented Mar 17, 2026

Weight loading via update_constants_from_blob was achieving only
0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from
mmap'd pages triggers synchronous page faults — each 16K page traps
into the kernel for NVMe I/O.

Call madvise(MADV_WILLNEED) on the weights blob
early in Metal backend init, before writing/dlopen'ing the .so file.
The kernel prefaults pages asynchronously during the ~200ms of other
init work. By the time memcpy runs, pages are already resident and
throughput reaches 5-8 GB/s.

Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.

Copilot AI review requested due to automatic review settings March 17, 2026 16:12
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18236

Note: Links to docs will display an error until the docs builds have been completed.

❌ 18 New Failures, 13 Pending, 3 Unrelated Failures

As of commit dd9de4e with merge base 1e17e28 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 17, 2026
@mergennachin mergennachin force-pushed the metal-prefetch-weights branch 2 times, most recently from 246e20d to 0cb2141 Compare March 17, 2026 16:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a Metal backend initialization bottleneck by prefetching an mmap’d weights blob early (via madvise) so later weight copies avoid synchronous page faults and better utilize disk bandwidth.

Changes:

  • Add madvise(MADV_SEQUENTIAL) + madvise(MADV_WILLNEED) prefetch for the weights blob during MetalBackend::init.
  • Compute weights_blob_key once early and reuse it later when loading constants.
Comments suppressed due to low confidence (1)

backends/apple/metal/runtime/metal_backend.cpp:271

  • The prefetch get_data() result is scoped to this block, so the FreeableBuffer will be freed/unmapped at scope exit (its destructor calls Free()). That likely defeats the intended overlap with the subsequent write/dlopen work and also forces a second get_data() later for the same key. Consider keeping the FreeableBuffer alive until update_constants_from_blob runs and reusing it (prefetch + consume) instead of fetching twice.
    // This overlaps disk I/O with the .so write + dlopen (~200ms).
    std::string weights_blob_key =
        method_name.empty() ? "weights_blob" : method_name + "_weights_blob";
    {
      auto prefetch_buf = named_data_map->get_data(weights_blob_key.c_str());
      if (prefetch_buf.ok() && prefetch_buf->data() != nullptr) {
        madvise(
            const_cast<void*>(prefetch_buf->data()),
            prefetch_buf->size(),
            MADV_WILLNEED);
      }
    }

    ET_LOG(
        Info,
        "MetalBackend::init - Looking for blob key: %s",
        so_blob_key.c_str());


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread backends/apple/metal/runtime/metal_backend.cpp
Comment thread backends/apple/metal/runtime/metal_backend.cpp Outdated
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Weight loading via update_constants_from_blob was achieving only
0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from
mmap'd pages triggers synchronous page faults — each 16K page traps
into the kernel for NVMe I/O.

Call madvise(MADV_WILLNEED) on the weights blob
early in Metal backend init, before writing/dlopen'ing the .so file.
The kernel prefaults pages asynchronously during the ~200ms of other
init work. By the time memcpy runs, pages are already resident and
throughput reaches 5-8 GB/s.

Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.
@mergennachin
Copy link
Copy Markdown
Contributor Author

@Gasoonjia try this on CUDA too

size_t page_size = getpagesize();
uintptr_t aligned_addr = addr & ~(page_size - 1);
size_t aligned_size = prefetch_buf->size() + (addr - aligned_addr);
int ret = madvise(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add MADV_SEQUENTIAL?

I guess a stronger version would be MAP_POPULATE right after mapping, since we know we will be loading these in. But this hint would be better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did both and it didn't have any improvement. For now, I'll keep it simple, unless it is required

@mergennachin
Copy link
Copy Markdown
Contributor Author

@JacobSzwejbka Eventually should we have this logic directly in extension/data_loader/mmap_data_loader.cpp?

@mergennachin mergennachin merged commit b7ca1a4 into main Mar 17, 2026
204 of 225 checks passed
@mergennachin mergennachin deleted the metal-prefetch-weights branch March 17, 2026 18:06
@manuelcandales
Copy link
Copy Markdown
Contributor

@pytorchbot cherry-pick --onto release/1.2 -c release

pytorchbot pushed a commit that referenced this pull request Mar 20, 2026
Weight loading via update_constants_from_blob was achieving only
0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from
mmap'd pages triggers synchronous page faults — each 16K page traps
into the kernel for NVMe I/O.

Call madvise(MADV_WILLNEED) on the weights blob
early in Metal backend init, before writing/dlopen'ing the .so file.
The kernel prefaults pages asynchronously during the ~200ms of other
init work. By the time memcpy runs, pages are already resident and
throughput reaches 5-8 GB/s.

Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.

(cherry picked from commit b7ca1a4)
@pytorchbot
Copy link
Copy Markdown
Collaborator

Cherry picking #18236

The cherry pick PR is at #18356 The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants