Prefetch mmap'd weight blobs to eliminate page fault bottleneck by mergennachin · Pull Request #18236 · pytorch/executorch

mergennachin · 2026-03-17T16:12:43Z

Weight loading via update_constants_from_blob was achieving only
0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from
mmap'd pages triggers synchronous page faults — each 16K page traps
into the kernel for NVMe I/O.

Call madvise(MADV_WILLNEED) on the weights blob
early in Metal backend init, before writing/dlopen'ing the .so file.
The kernel prefaults pages asynchronously during the ~200ms of other
init work. By the time memcpy runs, pages are already resident and
throughput reaches 5-8 GB/s.

Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.

pytorch-bot · 2026-03-17T16:12:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18236

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 18 New Failures, 13 Pending, 3 Unrelated Failures

As of commit dd9de4e with merge base 1e17e28 ():

NEW FAILURES - The following jobs have failed:

pull / test-coreml-bc-macos (macos-m1-stable) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
pull / test-coreml-bc-macos (macos-m2-stable) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
pull / unittest / macos / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 38c34143d3d3e9f693aa4bf700bacceb8602726b982a99bb887aee60f9a74216 /exec failed with exit code 1
pull / unittest-buck / macos / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
pull / unittest-editable / macos / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 054b7ce9d3ab14901ba12a425f244208a502decda1f160f555e26d1cb0bfdc1e /exec failed with exit code 1
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-metal) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-metal) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, non-quantized) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, quantized-int4-metal) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, quantized-int4-metal) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, quantized-int4-metal) / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / test-executorch-metal-build / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?
Test Metal Backend / test-metal-backend-modules / macos-job (gh)
Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/Users/ec2-user/runner/_work/executorch/executorch/test-infra/.github/actions/check-disk-space'. Did you forget to run actions/checkout before running your local action?

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / test-models-cuda (sdpa) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

This PR targets a Metal backend initialization bottleneck by prefetching an mmap’d weights blob early (via madvise) so later weight copies avoid synchronous page faults and better utilize disk bandwidth.

Changes:

Add madvise(MADV_SEQUENTIAL) + madvise(MADV_WILLNEED) prefetch for the weights blob during MetalBackend::init.
Compute weights_blob_key once early and reuse it later when loading constants.

Comments suppressed due to low confidence (1)

backends/apple/metal/runtime/metal_backend.cpp:271

The prefetch get_data() result is scoped to this block, so the FreeableBuffer will be freed/unmapped at scope exit (its destructor calls Free()). That likely defeats the intended overlap with the subsequent write/dlopen work and also forces a second get_data() later for the same key. Consider keeping the FreeableBuffer alive until update_constants_from_blob runs and reusing it (prefetch + consume) instead of fetching twice.

    // This overlaps disk I/O with the .so write + dlopen (~200ms).
    std::string weights_blob_key =
        method_name.empty() ? "weights_blob" : method_name + "_weights_blob";
    {
      auto prefetch_buf = named_data_map->get_data(weights_blob_key.c_str());
      if (prefetch_buf.ok() && prefetch_buf->data() != nullptr) {
        madvise(
            const_cast<void*>(prefetch_buf->data()),
            prefetch_buf->size(),
            MADV_WILLNEED);
      }
    }

    ET_LOG(
        Info,
        "MetalBackend::init - Looking for blob key: %s",
        so_blob_key.c_str());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

github-actions · 2026-03-17T16:17:47Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Weight loading via update_constants_from_blob was achieving only 0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from mmap'd pages triggers synchronous page faults — each 16K page traps into the kernel for NVMe I/O. Call madvise(MADV_WILLNEED) on the weights blob early in Metal backend init, before writing/dlopen'ing the .so file. The kernel prefaults pages asynchronously during the ~200ms of other init work. By the time memcpy runs, pages are already resident and throughput reaches 5-8 GB/s. Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.

mergennachin · 2026-03-17T16:35:50Z

@Gasoonjia try this on CUDA too

digantdesai · 2026-03-17T17:15:23Z

+        size_t page_size = getpagesize();
+        uintptr_t aligned_addr = addr & ~(page_size - 1);
+        size_t aligned_size = prefetch_buf->size() + (addr - aligned_addr);
+        int ret = madvise(


Also add MADV_SEQUENTIAL?

I guess a stronger version would be MAP_POPULATE right after mapping, since we know we will be loading these in. But this hint would be better.

I did both and it didn't have any improvement. For now, I'll keep it simple, unless it is required

mergennachin · 2026-03-17T17:23:34Z

@JacobSzwejbka Eventually should we have this logic directly in extension/data_loader/mmap_data_loader.cpp?

manuelcandales · 2026-03-20T01:11:03Z

@pytorchbot cherry-pick --onto release/1.2 -c release

Weight loading via update_constants_from_blob was achieving only 0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from mmap'd pages triggers synchronous page faults — each 16K page traps into the kernel for NVMe I/O. Call madvise(MADV_WILLNEED) on the weights blob early in Metal backend init, before writing/dlopen'ing the .so file. The kernel prefaults pages asynchronously during the ~200ms of other init work. By the time memcpy runs, pages are already resident and throughput reaches 5-8 GB/s. Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model. (cherry picked from commit b7ca1a4)

pytorchbot · 2026-03-20T01:13:44Z

Cherry picking #18236

The cherry pick PR is at #18356 The following tracker issues are updated:

[v1.2.0] Release Schedule and Tracker #17016 (comment)

Details for Dev Infra team

Raised by workflow job

mergennachin requested review from cccclai and shoumikhin as code owners March 17, 2026 16:12

Copilot AI review requested due to automatic review settings March 17, 2026 16:12

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 17, 2026

mergennachin requested review from digantdesai and manuelcandales March 17, 2026 16:12

Copilot started reviewing on behalf of mergennachin March 17, 2026 16:14 View session

mergennachin force-pushed the metal-prefetch-weights branch 2 times, most recently from 246e20d to 0cb2141 Compare March 17, 2026 16:16

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Comment thread backends/apple/metal/runtime/metal_backend.cpp

Comment thread backends/apple/metal/runtime/metal_backend.cpp Outdated

mergennachin force-pushed the metal-prefetch-weights branch from 0cb2141 to dd9de4e Compare March 17, 2026 16:26

mergennachin requested review from Gasoonjia and JacobSzwejbka March 17, 2026 16:35

manuelcandales approved these changes Mar 17, 2026

View reviewed changes

digantdesai reviewed Mar 17, 2026

View reviewed changes

mergennachin merged commit b7ca1a4 into main Mar 17, 2026
204 of 225 checks passed

mergennachin deleted the metal-prefetch-weights branch March 17, 2026 18:06

pytorchbot mentioned this pull request Mar 20, 2026

[v1.2.0] Release Schedule and Tracker #17016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefetch mmap'd weight blobs to eliminate page fault bottleneck#18236

Prefetch mmap'd weight blobs to eliminate page fault bottleneck#18236
mergennachin merged 1 commit into
mainfrom
metal-prefetch-weights

mergennachin commented Mar 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

mergennachin commented Mar 17, 2026

Uh oh!

digantdesai Mar 17, 2026

Uh oh!

mergennachin Mar 17, 2026

Uh oh!

mergennachin commented Mar 17, 2026

Uh oh!

Uh oh!

manuelcandales commented Mar 20, 2026

Uh oh!

pytorchbot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mergennachin commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18236

❌ 18 New Failures, 13 Pending, 3 Unrelated Failures

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 17, 2026

This PR needs a release notes: label

Uh oh!

mergennachin commented Mar 17, 2026

Uh oh!

digantdesai Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Mar 17, 2026

Uh oh!

Uh oh!

manuelcandales commented Mar 20, 2026

Uh oh!

pytorchbot commented Mar 20, 2026

Cherry picking #18236

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mergennachin commented Mar 17, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 17, 2026 •

edited

Loading

This PR needs a `release notes:` label