-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_userland: use a custom memory allocator to prevent page faults #87
Conversation
e793195
to
faafabb
Compare
To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
faafabb
to
c593e36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It may not be relevant to scx_rustland
right now but in a future where this allocator is used more widely, it may be useful to provide a per-thread override to use the default allocator which can expand the heap.
Having a fixed arena size may become a scalability pain point for schedulers which have per-task memory allocations. Arena size big enough to accommodate tail end use cases which can have a very high number of tasks can be too large for other cases. This can be worked around in a fairly straight-forward way by escaping to the heap-expanding allocator when allocating from .init_task()
. .init_task()
isn't in mem reclaim path, can block and be made to operate synchronously w.r.t. userspace. So, if we provide a separate channel for .init_task()
and do all per-task allocations from there with the "don't allocate from the fixed arena` flag set, this problem can be resolved.
unsafe impl GlobalAlloc for RustLandAllocator { | ||
unsafe fn alloc(&self, layout: Layout) -> *mut u8 { | ||
let align = layout.align(); | ||
if align > BLOCK_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 align should be really rare so this should be fine but larger alignments can be supported relatively easily by updating the scan to find the first block to skip to the next alignment position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I completely ignored align because right now the scheduler doesn't request any alignment at all, but I can add that considering that it's a fairly easy change.
if is_allocated { | ||
// Reset consecutive blocks count if an allocated block is encountered. | ||
contiguous_blocks = 0; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie. here, the code can just consider the block to be busy if contiguous_blocks
is zero and the offset isn't aligned.
Hm... but the per-task allocation will be performed by the BPF part. Are you suggesting to allocate some memory from |
Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No, I was just thinking about more general case of userspace component needing per-task memory allocation. For rustland, the per-task allocation happens in It isn't anything we need to worry about now. Just something to keep in mind for the future. |
To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults.
To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer.
This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition.
This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported.
In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it.
This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free).
This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations.
With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop):
$ ps -o pid,rss,etime,cmd -p
pidof scx_rustland
PID RSS ELAPSED CMD
34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland
Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions.