scx_userland: use a custom memory allocator to prevent page faults #87

arighi · 2024-01-13T10:04:21Z

To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults.

To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer.

This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition.

This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported.

In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it.

This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free).

This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations.

With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop):

$ ps -o pid,rss,etime,cmd -p pidof scx_rustland
PID RSS ELAPSED CMD
34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland

Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions.

To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

htejun

Looks good to me. It may not be relevant to scx_rustland right now but in a future where this allocator is used more widely, it may be useful to provide a per-thread override to use the default allocator which can expand the heap.

Having a fixed arena size may become a scalability pain point for schedulers which have per-task memory allocations. Arena size big enough to accommodate tail end use cases which can have a very high number of tasks can be too large for other cases. This can be worked around in a fairly straight-forward way by escaping to the heap-expanding allocator when allocating from .init_task(). .init_task() isn't in mem reclaim path, can block and be made to operate synchronously w.r.t. userspace. So, if we provide a separate channel for .init_task() and do all per-task allocations from there with the "don't allocate from the fixed arena` flag set, this problem can be resolved.

htejun · 2024-01-15T08:10:40Z

scheds/rust/scx_rustland/src/bpf/alloc.rs

+unsafe impl GlobalAlloc for RustLandAllocator {
+    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
+        let align = layout.align();
+        if align > BLOCK_SIZE {


64 align should be really rare so this should be fine but larger alignments can be supported relatively easily by updating the scan to find the first block to skip to the next alignment position.

True. I completely ignored align because right now the scheduler doesn't request any alignment at all, but I can add that considering that it's a fairly easy change.

htejun · 2024-01-15T08:11:29Z

scheds/rust/scx_rustland/src/bpf/alloc.rs

+            if is_allocated {
+                // Reset consecutive blocks count if an allocated block is encountered.
+                contiguous_blocks = 0;
+            } else {


ie. here, the code can just consider the block to be busy if contiguous_blocks is zero and the offset isn't aligned.

arighi · 2024-01-15T09:12:45Z

Looks good to me. It may not be relevant to scx_rustland right now but in a future where this allocator is used more widely, it may be useful to provide a per-thread override to use the default allocator which can expand the heap.

Having a fixed arena size may become a scalability pain point for schedulers which have per-task memory allocations. Arena size big enough to accommodate tail end use cases which can have a very high number of tasks can be too large for other cases. This can be worked around in a fairly straight-forward way by escaping to the heap-expanding allocator when allocating from .init_task(). .init_task() isn't in mem reclaim path, can block and be made to operate synchronously w.r.t. userspace. So, if we provide a separate channel for .init_task() and do all per-task allocations from there with the "don't allocate from the fixed arena` flag set, this problem can be resolved.

Hm... but the per-task allocation will be performed by the BPF part. Are you suggesting to allocate some memory from .init_task() and share that memory in the address space of the user-space scheduler?

Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

htejun · 2024-01-16T20:03:10Z

Hm... but the per-task allocation will be performed by the BPF part. Are you suggesting to allocate some memory from .init_task() and share that memory in the address space of the user-space scheduler?

No, I was just thinking about more general case of userspace component needing per-task memory allocation. For rustland, the per-task allocation happens in drain_queued_tasks() with .or_insert_with_key() call on task_map. This can alternatively structured so that BPF ops.init_task() calls userspace to trigger insertion so that we don't have dynamic allocation in the scheduling path. This isn't an immediate problem for rustland, but if you imagine a scheduler implementation that wants to service setups where there may be hundreds of thousands of tasks (which does happen in some fringes), setting up the fixed pool upfront can be a bit challenging. Being able to split out the per-task context allocation from the mlockall-backed fixed heap would allow side stepping the issue in most cases.

It isn't anything we need to worry about now. Just something to keep in mind for the future.

arighi force-pushed the scx-rustland-allocator branch from e793195 to faafabb Compare January 14, 2024 19:43

arighi added 2 commits January 14, 2024 22:07

arighi force-pushed the scx-rustland-allocator branch from faafabb to c593e36 Compare January 14, 2024 21:10

htejun approved these changes Jan 15, 2024

View reviewed changes

arighi merged commit 09e7905 into main Jan 15, 2024
2 checks passed

Byte-Lab deleted the scx-rustland-allocator branch March 14, 2024 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_userland: use a custom memory allocator to prevent page faults #87

scx_userland: use a custom memory allocator to prevent page faults #87

arighi commented Jan 13, 2024

htejun left a comment

htejun Jan 15, 2024

arighi Jan 15, 2024

htejun Jan 15, 2024

arighi commented Jan 15, 2024

htejun commented Jan 16, 2024

scx_userland: use a custom memory allocator to prevent page faults #87

scx_userland: use a custom memory allocator to prevent page faults #87

Conversation

arighi commented Jan 13, 2024

htejun left a comment

Choose a reason for hiding this comment

htejun Jan 15, 2024

Choose a reason for hiding this comment

arighi Jan 15, 2024

Choose a reason for hiding this comment

htejun Jan 15, 2024

Choose a reason for hiding this comment

arighi commented Jan 15, 2024

htejun commented Jan 16, 2024