Design: alternative scheduling models #32
Currently we use a simple round-robin scheduling strategy. It isn't technically FIFO, because we randomize the order of execution within each "batch" in an attempt to force people not to make too many assumptions about the scheduling strategy :-). But basically FIFO in its sophistication.
Can we do better?
There's a rich literature on fancy scheduling, e.g. the weighted-fair queuing used by the Linux kernel's CFS ("completely fair scheduler"). And there are a lot of challenges in building async applications that come down to scheduling issues (e.g., #14 and https://stackoverflow.com/questions/40601119). But AFAICT a lot of the existing scheduling literature assumes that you have a pre-emptive scheduler; there's very little out there on applying these ideas to cooperative scheduling systems. in some ways the network packet scheduling literature is more relevant to us, because packets come in different sizes and the scheduler doesn't get to change that. (OTOH, packet scheduling algorithms tend to assume you can look into the future and predict how long a packet will spend transmitting, whereas when we start a task step we don't know how long it will run before yielding.)
Does it matter though?
It's not entirely clear that "fairness" is actually the solution to the problems linked above! Fair task scheduling is in part about negotiating between somewhat adversarial users (e.g. different TCP connections fighting for their share of a link), which isn't as obvious a fit to the tasks in our system. Though OTOH those tasks may be doing work on behalf of different competing users. And scheduler algorithms only matter when a program is CPU-bound, which hopefully trio programs usually aren't. But OTOH even programs that are I/O-bound overall will become CPU-bound in bursts, and it's exactly when things are melting down under load that you'd most like to handle things gracefully. OTOOH the real solution is often going to be something like load shedding or otherwise rate-limiting incoming work; better scheduling isn't a silver bullet to fix load issues.
So, I really don't know whether this is actually a good/useful idea for trio or not. But it might be a place where we can make a substantial difference to trio programs' usability in a central, principled way, so that seems worth exploring!
In principle, a WFQ scheduler is actually very simple to implement (see below). What's not so clear is whether this would actually help in practice. There are two obvious issues:
Maybe what we really need is better knobs to let users set the priority of different tasks. (WFQ makes it easy to implement relative priorities, and relatively easy to implement hierarchical scheduling policies; it's also pretty straightforward to implement strictly tiered priorities like the Linux kernel's realtime priorities.) But "here's 100 knobs to tune, try tweaking some whenever your server performance degrades" is a pretty bad UX too. There's something to be said for the predictability of FIFO.
In general, this is a deep problem that will definitely require experience and data and visualizations of real apps on top of trio.
Ironically, the Python GIL is essentially a cooperative scheduling system (though a weird one), and probably the one with the best public analysis! Interesting documents:
The classic Mac OS used cooperative scheduling heavily. This was known as the "thread manager". I'm very curious what their scheduling strategy looked like, but I don't know how well documented it is. The "Inside Macintosh: Thread Manager" book (PDFs are floating around on the web) might have more details. [Edit: there's now some notes on this on the reading list wiki page.]
Alternatively, if we decide that we don't want a fancy scheduling system, we could go the other way, and actually guarantee deterministic FIFO scheduling, on the theory that determinism is generally a nice thing if you aren't losing anything else to get it.
The text was updated successfully, but these errors were encountered:
Weighted fair queuing
Weighted fair queuing (WFQ) is super elegant and simple, but I found it hard to work this out from the literature (partly because our context is a bit weird). So here are my notes on how it works.
The goal is to emulate a ideal multi-tasking CPU (in the academic literature this is called the "generalized process sharing" (GPS) model). When an ideal multi-tasking CPU has N tasks that want to run, then it splits itself into N mini-CPUs that each run at 1/N times the speed of the original CPU, and then each task runs on its own mini-CPU. So if there are N runnable tasks, then over any time span of T seconds they each get (T/N) seconds of CPU time. It's very simple, but, of course, reality doesn't work this way – we only have 1 CPU and our tasks actually have to take turns. So the idea is that we keep track of how much CPU time each task got on the real CPU, versus how long it should have gotten on its mini-CPU, and try to make them match up over time on average.
So this suggests a naive implementation of WFQ: whenever a task gets added to the run queue (i.e., on an ideal CPU it would be running, but in reality it's sitting and waiting for it's turn), then start a timer to track how much time it should have run for. Each time it ticks, the counter assigns the task another 1/(current length of run queue) ticks worth of CPU credit, since that's how much it should have gotten over that tick interval. (Notice that this might vary from tick to tick, because the length of the run queue varies over time.) And then whenever a task actually runs, we watch how much CPU time it uses, and decrement the accumulated credit by the appropriate amount. Now our scheduling policy is simple: at each schedule point, pick the task that has the most credit accumulated, and start running it.
Theoretically, this works great! That's all there is to it. BUT, the problem is that this implementation is slow, because if we have N tasks in the run queue then on every tick we have to do an annoying O(N) loop to scan through and update their credit balance. So we want to keep that behavior, but find a way to implement it that's more algorithmically efficient. One impulse might be to start looking for heuristics, but it turns out that we don't need any – there is a beautiful trick that lets us make this more efficient while remaining exactly fair!
Here's the trick: we introduce a "virtual clock" (or vclock for short). Our virtual clock ticks in virtual seconds (vseconds). The vclock always ticks at a rate of 1 vsecond per N wall seconds, where N is the number of runnable tasks – so this means that it's constantly speeding up and slowing down as tasks enter and leave the run queue. But don't worry about that – it will turn out that we don't have to keep track of the relationship between vtime and realtime in any kind of detailed way. The relationship is well-defined, but like, pretend that someone else is doing all the annoying bookkeeping work to keep track of it.
Instead of doing that, we're going to take advantage of a simple invariant. Imagine that we had an ideal "GPS" CPU and two tasks running on it, so they're each receiving 50% of the CPU time. This means that over 2 real seconds, they each receive 1 second of CPU time. It also means that over 2 real seconds, the vclock advances by 1 vsecond. In fact, this holds in general: for an ideal GPS CPU, over any time period where a task gets a total of 1 real second of CPU time, the vclock advances by 1 second. This works for arbitrary numbers of tasks, and it even works if the number of tasks fluctuates over time: if a new task starts running then every task starts getting a smaller proportion of the CPU, and the vclock slows down by exactly the same amount, so it still works; if a task goes to sleep then something similar happens in reverse.
It's kind of brain-melty, but it suggests a shockingly simple algorithm, which is efficient and gives perfectly fair scheduling. Give each task a vtime counter. This counter shows the time on the vclock at which the task starts deserving more CPU time. At the start, we initialize these all to the same arbitrary value. Then:
One way to think about this: we're simulating our ideal CPU running these tasks over virtual time, and our simulation has a single linear timeline measured by the vclock; each task's vtime counter shows the point on this shared timeline that our simulation of this task has reached.
Another, way more intuitive way to think about it: we just keep track of how long each task has been allowed to run, and pick the one where this value is smallest. However, the limitation of this simple intuition is that it doesn't help you deal with initializing the value when a new task enters the run queue.
Open question: handling new + woken tasks
For me this is the biggest open question. Notice that in the algorithm summary above, I skipped over the question of how to initialize the vtime of a newly spawned task, or what to do with the vtime of a task that's been sleeping and just woke up. The classic answer is that in both cases, you initialize the new vtime to
For new tasks, this makes sense for us too. Though there are alternatives, e.g., from a theoretical point of view it seems natural to check the parent's vtime at the moment they call
For woken tasks, this isn't going to work, because it would mean that a CPU hog could block the event loop for 10 seconds, then sleep for 0.1 second and start over with a blank slate. After playing with a few options on paper, I think a plausible thing to try might be: when a task wakes up, set its vtime to
For a pre-emptive scheduler, you want to add some hysteresis so as to avoid the situation where you have two tasks with the same vclock, you pick one, run it for 1 nanosecond, now the other task has a smaller vclock so you context switch, run that one for 1 nanosecond, now the first task has a smaller vclock again... basically the pre-emption rule is that the current task runs until either it sleeps or its vclock becomes > (min(other task's vclock) + some slop). In CFS this slop is called the "granularity" and is a tunable, but is usually set to a few milliseconds. For a cooperative scheduler, it's not clear whether this matters, since we rarely have the option of immediately re-running the same task, and there's an implicit maximum context-switch rate imposed by the granularity of individual task steps.
Priorities / niceness values
If you want a task to receive x% of its fair share, then when incrementing its vclock, first divide by x%. (So if a task with a 50% share runs for 1 second, then its vclock increments by 2 vseconds.)
Hierarchical weighted fair queuing
The idea here is that you split tasks into groups (possibly nested), and then use WFQ between and within groups. So if you have two groups you might say that they each collectively get 50% of the CPU time, and then within each group the subdivide their 50% fairly. Of course if the groups are of equal size then this is just the same as WFQ without groups, but if the groups are different sizes then it's different.
The intuition behind this is that our idea of fairness might not have the same granularity as our tasks (which are kind of an implementation detail). For example, consider a full-duplex protocol where both sides can read and write simultaneously. In trio this would generally involve making two tasks per session. And let's say we have two session, so four tasks total: send-to-A, recv-from-A, send-to-B, recv-from-B. And now let's say that for whatever reason, A is sending and receiving at full bandwidth, and B is sending at full bandwidth but not receiving. So we have 3 active tasks, and regular WFQ will end up allocating 66% of our server CPU to A, while B gets only 33%. Maybe it would be better if we put these tasks into an A group and a B group, so they each got 50% of the resources.
Implementing HWFQ is a bit more complicated, but not too much. I'll defer to the literature for the details though.
Actually measuring usage
For an OS kernel, it's trivial to measure how long a task has been running for, because they control the whole system. For us it's not quite so obvious – we can ask the OS what time it is when the task starts and when it ends and then subtract them, but if our process gets scheduled away, or some other thread steals the GIL, or a GC collection runs, then whatever task happened to be running at the time will get stuck with the bill. We could potentially measure CPU time instead (Linux:
I guess virtualized Linux systems have basically the same problem and they seem to do OK.
A related issue is the resolution of these timers – if a task runs for a very short period, can we measure its CPU time accurately? I suppose if it comes back as 0 then we can just reschedule it and keep the clock running :-).
Noting for future reference, a sketch of how to implement
The idea is that
One efficient approach would be to keep two internal queues: one sorted by vtime, and one sorted by arrival time. When we want to dequeue, first inspect the vtime queue for any tasks that have become potentially runnable, and move them into the arrival-time queue (which is sorted by the time they called
Perhaps a better way to think about it is: when a
Putting a big pile of tasks on/off the scheduler like this might be very difficult to do without quadratic costs, though. I think for regular WFQ this is probably avoidable (since there we know which task is going to unpark if any; the only possibility is that it might get pre-empted); for HWFQ it's... challenging.
The echo client in the tutorial is an interesting example of a case where ideally we might want to support task priorities (and maybe even hierarchical priorities).
The tutorial makes a big deal out of how splitting sending and receiving into two separate tasks avoids deadlocks and preserves good latency, but there's still a nasty case it could potentially get into where if the receiver just isn't scheduled often enough relative to the sender, then latency could accumulate up to the limits of the network buffer. (So not an infinite amount – eventually the
One solution would be to explicitly declare that the receiver task always has priority over the corresponding sender task. (Hierarchical scheduling would make it possible to declare exactly this; without hierarchical scheduling, we could still say that all receiver tasks have priority over all sender tasks.)
I'm not sure how typical the echo client is of real-world scheduling problems, but it's always nice to have simple concrete examples!
There's a little bit of discussion of better-than-FIFO scheduling policies in this thread on async-sig. This also spurred me to read a bit more, and I've discovered that what I call WFQ above might be what other people call "SFQ", as e.g. discussed in this chapter. (The usual cite for SFQ is GGV96, A hierarchical cpu scheduler for multimedia operating systems.)
There's also Earliest eligible virtual deadline first, which is what Con Kolivas cites as the inspiration for BFS/MuQSS (though I'm not sure those schedulers as implemented correspond to any well-articulated model).
I don't currently understand how and why SFQ and EEVDF differ.
That's weird. Usually, determinism is a valued property in scheduling, as it offers "higher" fairness. As an example, another asyncio alternative had to go out of its way to offer deterministic scheduling, because otherwise it was "randomized" in a way up to not being fair: micropython/micropython-lib#140 ("out of its way" there, is to spend 33-50% more memory on the queue structure, with the whole implementation tightly optimized for memory usage).
Hi @pfalcon! Can you give more details on the problems you ran into with lack of fairness in uasyncio? (Interesting project by the way, I hadn't encountered it before.) I read the linked bug report but couldn't figure out what exactly the problem was, or what degree of unfairness the prior algorithm allowed. To make sure it's clear: trio's current scheduling algorithm is randomized, but is still careful to put a strict bound on worst-case unfairness. If you have two tasks executing 'await sleep(0)' in a loop, then strict FIFO would give you the schedule ABABABAB... Trio doesn't guarantee that, but the worst that the current scheduler would allow is ABBAABBAABB... (and in practice since it's randomized you'll get something in between those two schedules – it's easy to try this with a print in the loop and see what happens).
uasyncio initially used heapq module for scheduling queue, with the usual trick of letting few tasks be scheduled for the same time unit : https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes (store entries as 3-element list including the priority, an entry count, and the task.). Having 3-element entry instead of 2-element meant spending 50% more memory on it (with MicroPython heaps counted in kilobytes), and was one of the big motivations to rewrite it to dedicated "timeq" modules.
But then turned out, that because heap data structure doesn't provide "stable sort" order, then if certain number of tasks are scheduled for the same time unit, then some of them don't get scheduled fair. If the other tasks also called
There're of course few conditions which lead to that, namely: a) finite precision of time scheduling; b) not really true randomized, but rather biased, "sorting". But the point is still: having a deterministic scheduling is boon, which helps to fight even unfortunate circumstances like above. So, in uasyncio, if tasks were scheduled in the order 1, 2, 3, 4, they will run in that order (and if they just yield to the loop each (not timed wait, no I/O), they will keep running in that order).
You can't imagine how I find Trio (mostly its docs) insightful. Discovering Trio actually prompted me to compile the timeline: https://github.com/pfalcon/micropython-lib/wiki/AsyncioAlternativesTimeline . You can compare my rhetoric from https://groups.google.com/d/msg/python-tulip/JA0-FC_pliA/knMvVGxp2WsJ with what you write in the first part of https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/ .
Ah, right, that kind of task starvation is definitely bad. Curio also had a number of issues like this before I fixed them. Trio avoids this by dequeuing tasks in batches: first it collects up everything that's ready to run right now, then it runs all those, and then only after it's finished that batch does it go back and look for more things to run. So if tasks 1, 2, 3 reschedule themselves immediately, that's fine, they get put in the queue, but we don't look at that queue until after we've finished running task 4 and go back to create our next batch. And the randomization happens within each batch, so it can't create any kind of long-term unfairness – at the boundaries between batches, every task has (deterministically!) run exactly as many times as it would have with strict FIFO.
Well, now I feel even sillier for having missed uasyncio before! It's definitely exciting to find others exploring the space of asyncio alternatives...
Another interesting real-world example of scheduling difficulties, from #twisted:
Just learned some more about how twisted deals with CPU-bound tasks.
The recommended method is to use
The default stopping condition is: if it's been less than 10 ms, run another task. (Or put another way: on each tick it runs as many tasks as can fit in 10 ms, plus one.)
So basically this is a way of scheduling CPU-bound tasks as a group, and then throttling that group as a whole.
I suspect that a smart scheduler could do some of this automatically, on the assumption that CPU-bound tasks will use more CPU than IO-bound tasks and thus get automatically throttled for fairness. Some way to explicitly mark tasks as lower priority might also be useful. It's also interesting to think about what the twisted approach actually ends up doing, in this conceptual framework – it's not quite like either a Linux nice value (= the weight in WFQ) or a Linux realtime priority (= lower priority tasks don't run at all while a higher priority task can run).
This article on "scheduling capabilities" also looks neat! I haven't read it yet, and it's probably more about static "realtime"-ish priorities than the WFQ-type stuff discussed above, but I want to at least capture the link: https://dl.acm.org/citation.cfm?doid=3190508.3190539
Another likely example of strict-checkpoint-fairness causing problems: https://gitter.im/python-trio/general?at=5f526634dfaaed4ef52ef17f
Best-guess analysis: https://gitter.im/python-trio/general?at=5f536ec9a5788a3c29d5f248
[Edit: turns out that this was probably a false alarm, though the scenario described there is still plausible, even if it's not what was going on here.]