Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
futex2: Implement wait and wake functions
Create a new set of futex syscalls known as futex2. This new interface
is aimed to implement a more maintainable code, while removing obsolete
features and expanding it with new functionalities.
Implements wait and wake semantics for futexes, along with the base
infrastructure for future operations. The whole wait path is designed to
be used by N waiters, thus making easier to implement vectorized wait.
* Syscalls implemented by this patch:
- futex_wait(void *uaddr, unsigned int val, unsigned int flags,
struct timespec *timo)
The user thread is put to sleep, waiting for a futex_wake() at uaddr,
if the value at *uaddr is the same as val (otherwise, the syscall
returns immediately with -EAGAIN). timo is an optional timeout value
for the operation.
Return 0 on success, error code otherwise.
- futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)
Wake `nr_wake` threads waiting at uaddr.
Return the number of woken threads on success, error code otherwise.
** The `flag` argument
The flag is used to specify the size of the futex word
(FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no
default size.
By default, the timeout uses a monotonic clock, but can be used as a
realtime one by using the FUTEX_REALTIME_CLOCK flag.
By default, futexes are of the private type, that means that this user
address will be accessed by threads that shares the same memory region.
This allows for some internal optimizations, so they are faster.
However, if the address needs to be shared with different processes
(like using `mmap()` or `shm()`), they need to be defined as shared and
the flag FUTEX_SHARED_FLAG is used to set that.
By default, the operation has no NUMA-awareness, meaning that the user
can't choose the memory node where the kernel side futex data will be
stored. The user can choose the node where it wants to operate by
setting the FUTEX_NUMA_FLAG and using the following structure (where X
can be 8, 16, or 32):
struct futexX_numa {
__uX value;
__sX hint;
};
This structure should be passed at the `void *uaddr` of futex
functions. The address of the structure will be used to be waited/waken
on, and the `value` will be compared to `val` as usual. The `hint`
member is used to defined which node the futex will use. When waiting,
the futex will be registered on a kernel-side table stored on that
node; when waking, the futex will be searched for on that given table.
That means that there's no redundancy between tables, and the wrong
`hint` value will led to undesired behavior. Userspace is responsible
for dealing with node migrations issues that may occur. `hint` can
range from [0, MAX_NUMA_NODES], for specifying a node, or -1, to use
the same node the current process is using.
When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be
stored on a global table on some node, defined at compilation time.
** The `timo` argument
As per the Y2038 work done in the kernel, new interfaces shouldn't add
timeout options known to be buggy. Given that, `timo` should be a 64bit
timeout at all platforms, using an absolute timeout value.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
[RFC Add futex2 syscall 0/0]
Hi,
This patch series introduces the futex2 syscalls.
* What happened to the current futex()?
For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept them, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.
** The NUMA problem
At the current implementation, all futex kernel side infrastructure is
stored on a single node. Given that, all futex() calls issued by
processors that aren't located on that node will have a memory access
penalty when doing it.
** The 32bit sized futex problem
Embedded systems or anything with memory constrains would benefit of
using smaller sizes for the futex userspace integer. Also, a mutex
implementation can be done using just three values, so 8 bits is enough
for various scenarios.
** The wait on multiple problem
The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal. Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running
over Wine.
[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/
* The solution
As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.
In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.
[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/
* The interface
The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:
- Supports wake/wait semantics, as in futex()
- Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
individual flags for each address
- Supports waiting for a vector of futexes, using a new syscall named
futex_waitv()
- Supports variable sized futexes (8bits, 16bits and 32bits)
- Supports NUMA-awareness operations, where the user can specify on
which memory node would like to operate
* Implementation
The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.
Each patch contains a brief description of implementation, but patch 6
"docs: locking: futex2: Add documentation" adds a more complete document
about it.
* The patchset
This patchset can be also found at my git tree:
https://gitlab.collabora.com/tonyk/linux/-/tree/futex2
- Patch 1: Implements wait/wake, and the basics foundations of futex2
- Patches 2-4: Implement the remaining features (shared, waitv, requeue).
- Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated
patch since I'm not sure if x86_x32 is still a thing, or if it should
return -ENOSYS.
- Patch 6: Add a documentation file which details the interface and
the internal implementation.
- Patches 7-13: Selftests for all operations along with perf
support for futex2.
- Patch 14: While working on porting glibc for futex2, I found out
that there's a futex_wake() call at the user thread exit path, if
that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
order to make pthreads work with futex2, it was required to add
this patch. Note that this is more a proof-of-concept of what we
will need to do in future, rather than part of the interface and
shouldn't be merged as it is.
* Testing:
This patchset provides selftests for each operation and their flags.
Along with that, the following work was done:
** Stability
To stress the interface in "real world scenarios":
- glibc[4]: nptl's low level locking was modified to use futex2 API
(except for robust and PI things). All relevant nptl/ tests passed.
- Wine[5]: Proton/Wine was modified in order to use futex2() for the
emulation of Windows NT sync mechanisms based on futex, called "fsync".
Triple-A games with huge CPU's loads and tons of parallel jobs worked
as expected when compared with the previous FUTEX_WAIT_MULTIPLE
implementation at futex(). Some games issue 42k futex2() calls
per second.
- Full GNU/Linux distro: I installed the modified glibc in my host
machine, so all pthread's programs would use futex2(). After tweaking
systemd[6] to allow futex2() calls at seccomp, everything worked as
expected (web browsers do some syscall sandboxing and need some
configuration as well).
- perf: The perf benchmarks tests can also be used to stress the
interface, and they can be found in this patchset.
** Performance
- For comparing futex() and futex2() performance, I used the artificial
benchmarks implemented at perf (wake, wake-parallel, hash and
requeue). The setup was 200 runs for each test and using 8, 80, 800,
8000 for the number of threads, Note that for this test, I'm not using
patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
at "The patchset" section.
- For the first three ones, I measured an average of 4% gain in
performance. This is not a big step, but it shows that the new
interface is at least comparable in performance with the current one.
- For requeue, I measured an average of 21% decrease in performance
compared to the original futex implementation. This is expected given
the new design with individual flags. The performance trade-offs are
explained at patch 4 ("futex2: Implement requeue operation").
[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
[6] https://gitlab.collabora.com/tonyk/systemd
* FAQ
** "Where's the code for NUMA and FUTEX_8/16?"
The current code is already complex enough to take some time for
review, so I believe it's better to split that work out to a future
iteration of this patchset. Besides that, this RFC is the core part of the
infrastructure, and the following features will not pose big design
changes to it, the work will be more about wiring up the flags and
modifying some functions.
** "And what's about FUTEX_64?"
By supporting 64 bit futexes, the kernel structure for futex would
need to have a 64 bit field for the value, and that could defeat one of
the purposes of having different sized futexes in the first place:
supporting smaller ones to decrease memory usage. This might be
something that could be disabled for 32bit archs (and even for
CONFIG_BASE_SMALL).
Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
** "Where's the PI/robust stuff?"
As said by Peter Zijlstra at [3], all those new features are related to
the "simple" futex interface, that doesn't use PI or robust. Do we want
to have this complexity at futex2() and if so, should it be part of
this patchset or can it be future work?
Thanks,
André- Loading branch information