Prefetch mmap'd weight blobs to eliminate page fault bottleneck#18236
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18236
Note: Links to docs will display an error until the docs builds have been completed. ❌ 18 New Failures, 13 Pending, 3 Unrelated FailuresAs of commit dd9de4e with merge base 1e17e28 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
246e20d to
0cb2141
Compare
There was a problem hiding this comment.
Pull request overview
This PR targets a Metal backend initialization bottleneck by prefetching an mmap’d weights blob early (via madvise) so later weight copies avoid synchronous page faults and better utilize disk bandwidth.
Changes:
- Add
madvise(MADV_SEQUENTIAL)+madvise(MADV_WILLNEED)prefetch for the weights blob duringMetalBackend::init. - Compute
weights_blob_keyonce early and reuse it later when loading constants.
Comments suppressed due to low confidence (1)
backends/apple/metal/runtime/metal_backend.cpp:271
- The prefetch
get_data()result is scoped to this block, so theFreeableBufferwill be freed/unmapped at scope exit (its destructor callsFree()). That likely defeats the intended overlap with the subsequent write/dlopen work and also forces a secondget_data()later for the same key. Consider keeping theFreeableBufferalive untilupdate_constants_from_blobruns and reusing it (prefetch + consume) instead of fetching twice.
// This overlaps disk I/O with the .so write + dlopen (~200ms).
std::string weights_blob_key =
method_name.empty() ? "weights_blob" : method_name + "_weights_blob";
{
auto prefetch_buf = named_data_map->get_data(weights_blob_key.c_str());
if (prefetch_buf.ok() && prefetch_buf->data() != nullptr) {
madvise(
const_cast<void*>(prefetch_buf->data()),
prefetch_buf->size(),
MADV_WILLNEED);
}
}
ET_LOG(
Info,
"MetalBackend::init - Looking for blob key: %s",
so_blob_key.c_str());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
This PR needs a
|
Weight loading via update_constants_from_blob was achieving only 0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from mmap'd pages triggers synchronous page faults — each 16K page traps into the kernel for NVMe I/O. Call madvise(MADV_WILLNEED) on the weights blob early in Metal backend init, before writing/dlopen'ing the .so file. The kernel prefaults pages asynchronously during the ~200ms of other init work. By the time memcpy runs, pages are already resident and throughput reaches 5-8 GB/s. Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.
0cb2141 to
dd9de4e
Compare
|
@Gasoonjia try this on CUDA too |
| size_t page_size = getpagesize(); | ||
| uintptr_t aligned_addr = addr & ~(page_size - 1); | ||
| size_t aligned_size = prefetch_buf->size() + (addr - aligned_addr); | ||
| int ret = madvise( |
There was a problem hiding this comment.
Also add MADV_SEQUENTIAL?
I guess a stronger version would be MAP_POPULATE right after mapping, since we know we will be loading these in. But this hint would be better.
There was a problem hiding this comment.
I did both and it didn't have any improvement. For now, I'll keep it simple, unless it is required
|
@JacobSzwejbka Eventually should we have this logic directly in extension/data_loader/mmap_data_loader.cpp? |
|
@pytorchbot cherry-pick --onto release/1.2 -c release |
Weight loading via update_constants_from_blob was achieving only 0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from mmap'd pages triggers synchronous page faults — each 16K page traps into the kernel for NVMe I/O. Call madvise(MADV_WILLNEED) on the weights blob early in Metal backend init, before writing/dlopen'ing the .so file. The kernel prefaults pages asynchronously during the ~200ms of other init work. By the time memcpy runs, pages are already resident and throughput reaches 5-8 GB/s. Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model. (cherry picked from commit b7ca1a4)
Cherry picking #18236The cherry pick PR is at #18356 The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Weight loading via update_constants_from_blob was achieving only
0.3-0.4 GB/s (vs 8 GB/s hardware capability) because memcpy from
mmap'd pages triggers synchronous page faults — each 16K page traps
into the kernel for NVMe I/O.
Call madvise(MADV_WILLNEED) on the weights blob
early in Metal backend init, before writing/dlopen'ing the .so file.
The kernel prefaults pages asynchronously during the ~200ms of other
init work. By the time memcpy runs, pages are already resident and
throughput reaches 5-8 GB/s.
Metal init time: ~25s -> ~9s (2.7x faster) on int4 Voxtral model.