-
Notifications
You must be signed in to change notification settings - Fork 29
Enabling shared-nothing executors #13
Comments
There was also some discussion around similar problems on the |
What exactly is the concern with making |
Both, some kind of systems are not designed to support cross thread awakening of a task, some of them don't even have threads and some can't afford this because of latency restrictions. Having This is a nice example: http://seastar.io/shared-nothing/ Generally I think users should not pay for what they don't use. |
I'm familiar with seastar's design, and it's totally possible to write an executor like that that locks a task or set of tasks to a particular thread. However, you're right that such a system would still incur some amount of synchronization overhead from having atomic reference-counting instead of non-atomic reference counting. When writing a If you need none of these things (no task or IO object will ever cross a thread boundary or interact with another task on a different thread), then I suppose we could enable this use case via runtime checks by getting rid of the The biggest concern I have with this approach is that all libraries would still have to choose whether or not they used the threadsafe or the non-thread-safe versions of If you know of any benchmarks showing significant performance loss in seastar-style applications due to the overhead of atomic reference counting or synchronization in wakeup notifications, I'd love to dig in and read about why that is and what the issues are. Benchmarking I've done on my own applications hasn't shown this to be an issue. I believe @carllerche and @alexcrichton also did similar benchmarks when they decided to make |
This is a great summary, I totaly agree on the pros/cons you described. I'd add that the thing that troubles me most about I'm not aware of any benchmarks in this area, I'd love to see them too. Will google. Probably write a few myself, tho proper benchmarks are hard. If you have any ideas on the approach I'd happy to hear them. Rings of tasks comes to mind. |
Yup-- I agree. It means that you need to use something like a lock-free queue of tasks, or a mutex over a slab of tasks, rather than a plain old
That would be awesome! I'd love to see what you find out, especially if it pushes us in a different direction that allows us to get noticeable performance gains. |
I know it's not as good as a benchmark inside a real-world computation, but here are some latencies for atomics measured in this paper. These all assume the memory is in cache of one of the processors. Fetch-and-add had the same latency as compare-and-swap.
One thing I love about Rust is that its fine-grained abstractions map very well to the fundamental constraints of the underlying problem space (i.e. concurrent programming on general-purpose processors). I expect most of these fundamental constraints to stay the same for a long time. So far, I'm not convinced that an atomic operation being shadowed by other computations is fundamental to the problem space. That may be saying more about the types of problems people solve with such abstractions today as anything else. Now to stir the pot a bit, let's take a possibly-controversial example from Seastar's homepage, processing packets at 10GB/s. I'm going to use a 10GB/s data rate after packet overhead, since I know Infiniband networks can handle this pretty easily.
If we're handling 1024-byte packets at 10GBps, a single uncontested atomic operation could be 10-30% of our processing time, even in the best case. Of course, this might not be a reasonable thing to do with an async abstraction. But is there a fundamental reason why not? The relative speeds of RAM, cache, CPU, and networks are always changing, and with them the types of problems that we solve. Plus, adding ergonomic async/await syntax to a high-performance language with zero-cost abstractions may bring out some unforeseen use cases! It would be great if someone more knowledgeable than me could offer concrete examples of problems that might benefit from such an abstraction. I know in the world of dataflow computations, a heat diffusion simulation is a popular example. The linked code uses futures, but I don't know much about its performance characteristics other than that. |
In reply to @cramertj, I don't believe any serious benchmarks were made. Not on my part at least. I do know that with the current implementation, the That said, I plan on heavily leveraging both thread locals to have fast paths for cases that result in work happening on the same thread, so in those cases, most synchronization operations can be avoided. |
Small plug for the thread_local crate: https://docs.rs/thread_local/0.3.5/thread_local/ The regex crate uses it precisely because it optimizes for the same thread use case. (Apologies if my suggestion is in right field. I'm not following the broader context too closely.) |
I did some experiments last weekend: https://github.com/rozaliev/futures-sync-bench The benchmark includes a simple local task scheduler with swappable wake queue. 2 queues implemented, one with Atm there is only one test case implemented, it's a ring of tasks. We spawn a task, it spawns one more and yields, the next one does the same, until we reach Nth task. Now we have N tasks, Nth can wake N-1, and so on, until the root task is awaken, the root one wakes Nth. It's a ring. This test case is clearly synthetic and measures only a direct impact of synchronization. This are some preliminary results I got on my macbook, so they should be taken with a grain of salt.
To make something actionable out of benchmarks:
I'm going to continue working on this as time allows. Ideas, suggestions, code reviews and PRs are very welcomed. |
I've been thinking a lot about this and talking with some other folks at Google (cc @ctiller) about how different async executors work and what requirements they have around thread-safety. While I still think that My initial plan for how to support these use cases was to make the default Unfortunately, actually implementing @aturon and I discussed a solution which would allow us to future-proof What do y'all think? It's not the most satisfying solution in that it doesn't give us immediate access to non- |
@cramertj I think that proposal makes sense. Essentially you would add a new method
edit: spawning is orthogonal to async/await.. must have been tired. |
Having |
Overview
Today's Futures design seems to be focused more on M:N scheduling. It's indeed the most popular and generic use case. But I think we should also discuss future proofing Futures for shared-nothing and
no_std
usecases.Some examples:
To summarize: there's usually either only one thread or many, but in latter case every one is pinned to the CPU's core. No synchronization is allowed, including atomics. For
no_std
task wakening is usually custom built using platform specific tools. Andstd
systems wake tasks only in current thread, cross thread wakening is very specific to the higher level system design. Do concurrency, leave parallelism to the user.For implementation it means that none of the standard operations like spawning task, polling or wakening should use mutexes or atomics. There's also should be no Sync and Send bounds.
Today it feels very awkward to implement something like this with futures.
Details
Wake
for example should to be insideArc
and has to be Send+Sync. For implementer it meansone of this options:
Wake::wake
noop dummy and use different waking strategyExecutor
trait also requires every task to beSend
, making every Future that wants to use a default executor incompatible with shared nothing system.As far as can tell after a quick look this PR seems to be dealing with similar issues.
Ideas?
There might be already a way to solve all this problems that I just don't know about, in that case we should just write some docs describing how to approach this problem.
If there is no clear way, then off the top of my head, I'd say make
Wake
thread local, add anupgrade
method that will returnOption<SyncWake>
. ForExecutor
addspawn_local
.This leaves the question though, how should an intermediate Future know which API version it should use: local or sync.
What do you think?
The text was updated successfully, but these errors were encountered: