Task, Timer, I/O stealing and pathological starvation #3873
Replies: 4 comments 6 replies
-
I'm still thinking this through, but a few comments:
|
Beta Was this translation helpful? Give feedback.
-
I've been thinking about this and I am fairly confident that we can implement I/O stealing for io_uring without locking. io_uring is special in this regard because we can directly read completion queue events from memory without needing to make any syscalls. |
Beta Was this translation helpful? Give feedback.
-
The current implementation will generating unexpected results if you using the timer to generate lock step frames, that's why I use a dedicated HashWheelTimer for this |
Beta Was this translation helpful? Give feedback.
-
I have some thoughts on this. Going to fork off into a separate discussion since I think it's philosophically interesting in its own right, but the bottom line is that starvation mitigation is best-effort; we'll take what we can get, and we won't intentionally create suboptimal behavior, but we also won't compromise on the primary purpose. The primary purpose of stealing is balancing load across available workers to improve throughput. |
Beta Was this translation helpful? Give feedback.
-
As of CE 3.6, the
WorkStealingThreadPool
is a fully-integrated runtime, where tasks, timers, and I/O are scheduled locally on theWorkerThread
where they were created. By doing so, we avoid a great deal of context switches and synchronization and overall have better cache utilization. Unfortunately, this optimization leaves us vulnerable to pathological starvation: if aWorkerThread
gets stuck on some long-running task (e.g. CPU bound or a rogue blocking op) then it cannot run its other tasks, or trigger its timers, or poll for I/O.One possible mitigation for this sort of starvation is stealing, which we've currently implemented for tasks and timers. It empowers threads to steal tasks and timers from another thread that may be starving in that moment. However, currently a thread will only attempt to steal if it has no work of its own. This means that under load, when all threads are fully occupied with their own work, there are no threads available to rescue tasks and timers from a starving thread. Also, threads will only attempt to steal expired timers before giving up, but a starved thread may need help with its unexpired timers in the near future.
To properly mitigate against starvation, threads would need to make deliberate attempts to steal from other threads even when they already have their own work. Furthermore, unnecessarily moving work across threads is not ideal, so I imagine a
WorkerThread
should keep some timestamp of its last iteration through its runloop that can be used by other threads to determine if it is starving or not and thus whether its work should be aggressively stolen.There's also the problem that implementing I/O stealing is non-trivial and generally will require some form of locking, which we want to avoid as it undermines the aforementioned optimizations. So it's possible that I/O is fundamentally vulnerable to pathological starvation and it begs the question whether it is worth working so hard to create mitigations for tasks and timers when we cannot solve the problem in general.
Beta Was this translation helpful? Give feedback.
All reactions