Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_userland: use a custom memory allocator to prevent page faults #87

Merged
merged 3 commits into from
Jan 15, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 13, 2024

To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults.

To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer.

This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition.

This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported.

In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it.

This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free).

This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations.

With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop):

$ ps -o pid,rss,etime,cmd -p pidof scx_rustland
PID RSS ELAPSED CMD
34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland

Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions.

To prevent potential deadlock conditions under heavy loads, any
scheduler that delegates scheduling decisions to user-space should avoid
triggering page faults.

To address this issue, replace the default Rust allocator with a custom
one (RustLandAllocator), designed to operate on a pre-allocated buffer.

This, coupled with the memory locking (via mlockall), prevents page
faults from happening during the execution of the user-space scheduler,
avoiding the deadlock condition.

This memory allocator is completely transparent to the user-space
scheduler code and it is applied automatically when the bpf module is
imported.

In the future we may decide to move this allocator to a more generic
place (scx_utils crate), so that also other user-space Rust schedulers
can use it.

This initial implementation of the RustLandAllocator is very simple: a
basic block-based allocator that uses an array to track the status of
each memory block (allocated or free).

This allocator can be improved in the future, but right now, despite its
simplicity, it shows a reasonable speed and efficiency in meeting memory
requests from the user-space scheduler, having to deal mostly with small
and uniformly sized allocations.

With this change in place scx_rustland survived more than 10hrs on a
heavily stressed system (with stress-ng and kernel builds running in a
loop):

 $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland`
     PID   RSS     ELAPSED CMD
   34966 75840    10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland

Without this change it is possible to trigger the sched-ext watchdog
timeout in less than 5min, under the same system load conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Periodically report a page fault counter in the scheduler output. The
user-space scheduler should never trigger page faults, otherwise we may
experience deadlocks (that would trigger the sched-ext watchdog,
unloading the scheduler).

Reporting a page fault counter periodically to stdout can be really
helpful to debug potential issues with the custom allocator.

Moreover, group together also nr_sched_congested and
nr_failed_dispatches with nr_page_faults and use the sum of all these
counters to determine the healthy status of the user-space scheduler
(reporting it to stdout as well).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It may not be relevant to scx_rustland right now but in a future where this allocator is used more widely, it may be useful to provide a per-thread override to use the default allocator which can expand the heap.

Having a fixed arena size may become a scalability pain point for schedulers which have per-task memory allocations. Arena size big enough to accommodate tail end use cases which can have a very high number of tasks can be too large for other cases. This can be worked around in a fairly straight-forward way by escaping to the heap-expanding allocator when allocating from .init_task(). .init_task() isn't in mem reclaim path, can block and be made to operate synchronously w.r.t. userspace. So, if we provide a separate channel for .init_task() and do all per-task allocations from there with the "don't allocate from the fixed arena` flag set, this problem can be resolved.

unsafe impl GlobalAlloc for RustLandAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let align = layout.align();
if align > BLOCK_SIZE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 align should be really rare so this should be fine but larger alignments can be supported relatively easily by updating the scan to find the first block to skip to the next alignment position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I completely ignored align because right now the scheduler doesn't request any alignment at all, but I can add that considering that it's a fairly easy change.

if is_allocated {
// Reset consecutive blocks count if an allocated block is encountered.
contiguous_blocks = 0;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie. here, the code can just consider the block to be busy if contiguous_blocks is zero and the offset isn't aligned.

@arighi
Copy link
Collaborator Author

arighi commented Jan 15, 2024

Looks good to me. It may not be relevant to scx_rustland right now but in a future where this allocator is used more widely, it may be useful to provide a per-thread override to use the default allocator which can expand the heap.

Having a fixed arena size may become a scalability pain point for schedulers which have per-task memory allocations. Arena size big enough to accommodate tail end use cases which can have a very high number of tasks can be too large for other cases. This can be worked around in a fairly straight-forward way by escaping to the heap-expanding allocator when allocating from .init_task(). .init_task() isn't in mem reclaim path, can block and be made to operate synchronously w.r.t. userspace. So, if we provide a separate channel for .init_task() and do all per-task allocations from there with the "don't allocate from the fixed arena` flag set, this problem can be resolved.

Hm... but the per-task allocation will be performed by the BPF part. Are you suggesting to allocate some memory from .init_task() and share that memory in the address space of the user-space scheduler?

Even if the current implementation of the user-space scheduler doesn't
require to allocate aligned memory, add a simple support to aligned
allocations in RustLandAllocator, in order to make it more generic and
potentially usable by other schedulers / components.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi merged commit 09e7905 into main Jan 15, 2024
2 checks passed
@htejun
Copy link
Contributor

htejun commented Jan 16, 2024

Hm... but the per-task allocation will be performed by the BPF part. Are you suggesting to allocate some memory from .init_task() and share that memory in the address space of the user-space scheduler?

No, I was just thinking about more general case of userspace component needing per-task memory allocation. For rustland, the per-task allocation happens in drain_queued_tasks() with .or_insert_with_key() call on task_map. This can alternatively structured so that BPF ops.init_task() calls userspace to trigger insertion so that we don't have dynamic allocation in the scheduling path. This isn't an immediate problem for rustland, but if you imagine a scheduler implementation that wants to service setups where there may be hundreds of thousands of tasks (which does happen in some fringes), setting up the fixed pool upfront can be a bit challenging. Being able to split out the per-task context allocation from the mlockall-backed fixed heap would allow side stepping the issue in most cases.

It isn't anything we need to worry about now. Just something to keep in mind for the future.

@Byte-Lab Byte-Lab deleted the scx-rustland-allocator branch March 14, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants