scx_rustland: performance improvements #136

arighi · 2024-02-11T13:31:59Z

Reduce custom memory allocator overhead by switching to buddy-alloc and use a HashMap to speed up search by PID in tasks BTreeSet.

In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we perform an O(N) search each time we add an item, to verify whether the PID already exists or not. Under heavy stress test conditions the O(N) complexity can have a potential impact on the overall performance. To mitigate this, introduce a HashMap that can be used to retrieve tasks by PID typically with a O(1) complexity. This could potentially degrade to O(N) in presence of hash collisions, but even in this case, accessing the hash map is still more efficient than scanning all the entries in the BTreeSet to search for the target PID. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Currently, the primary bottleneck in scx_rustland lies within its custom memory allocator, which is used to prevent page faults in the user-space scheduler. This is pretty evident looking at perf top: 39.95% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc 3.41% [kernel] [k] _copy_from_user 3.20% [kernel] [k] __kmem_cache_alloc_node 2.59% [kernel] [k] __sys_bpf 2.30% [kernel] [k] __kmem_cache_free 1.48% libc.so.6 [.] syscall 1.45% [kernel] [k] __virt_addr_valid 1.42% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc 1.31% [kernel] [k] _copy_to_user 1.23% [kernel] [k] entry_SYSRETQ_unsafe_stack However, there's no need to reinvent the wheel here, rather than relying on an overly simplistic and inefficient allocator, we can rely on buddy-alloc [1], which is also capable of operating on a preallocated memory buffer. After switching to buddy-alloc, the performance profile under the same workload conditions looks like the following: 6.01% [kernel] [k] _copy_from_user 5.21% [kernel] [k] __kmem_cache_alloc_node 4.45% [kernel] [k] __sys_bpf 3.80% [kernel] [k] __kmem_cache_free 2.79% libc.so.6 [.] syscall 2.34% [kernel] [k] __virt_addr_valid 2.26% [kernel] [k] _copy_to_user 2.14% [kernel] [k] __check_heap_object 2.10% [kernel] [k] __check_object_size.part.0 2.02% [kernel] [k] entry_SYSRETQ_unsafe_stack With this change in place, the primary overhead is now moved to the bpf() syscall and the copies between kernel and user-space (this could potentially be optimized in the future using BPF ring buffers, instead of BPF FIFO queues). A better focus at the allocator overhead before vs after this change: [before] 39.95% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 1.42% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [after] 1.50% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 0.76% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [1] https://crates.io/crates/buddy-alloc Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi added 2 commits February 11, 2024 14:11

arighi force-pushed the rustland-perf-improvement branch from e364e44 to fc889c6 Compare February 11, 2024 13:33

htejun approved these changes Feb 11, 2024

View reviewed changes

arighi merged commit b5f3f7f into main Feb 11, 2024
1 check passed

Byte-Lab deleted the rustland-perf-improvement branch March 14, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland: performance improvements #136

scx_rustland: performance improvements #136

arighi commented Feb 11, 2024

scx_rustland: performance improvements #136

scx_rustland: performance improvements #136

Conversation

arighi commented Feb 11, 2024