-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append instead of enqueue tasks woken up by notifications #902
Comments
"Notifications should influence scheduling as little as possible." -- this is not obviously true to me. It's not obvious what the right thing is here at all to me without thinking deeply about it. The main use case I can think is to skip threads stuck on a spin lock that might be waiting for a thread blocked on a notification to kick the lock. Deeper thinking needed here to validate that's a sensible scenario though . |
I cannot be sure why the implementation ended up the way it is, but I suspect it goes back to originally notifications sharing most of the code path with IPC (that was back in the days when they were misleadingly called "asynchronous IPC", which caused no end of confusion, see https://trustworthy.systems/publications/nictaabstracts/Heiser_Elphinstone_16.abstract). Conceptually I agree with @Indanz: Notifications should not change scheduling more than necessary, and in particular should not re-order round-robin queues (that's sort-of implied in "round robin"). In general, system calls should avoid side effects that are not required. Having said that, I'm not sure I exactly understand the situation here. Is the scenario you're describing the following?
In that scenario I'd expect A to precede B in the ready queue, behind any threads that were already in the queue of prio P before the first signal. If the order is the opposite then that's the wrong behaviour in my view. Now I may completely misread the scenario, in which case I'd appreciate if someone enlightened me. (This is likely one of the things around scheduling not explicitly specified, so it would not an implementation bug and couldn't be captured by verification, but it'd be a bug in the informal spec that was implemented in lieu of a formal one. And a reminder that we'll eventually have to specify more of the scheduling behaviour to fully support reasoning about timeliness.) |
Blocking while holding a spinlock would be bad locking design. But assuming a more sensible variant, why would a just woken up thread be put ahead of an already running one? If the running one is holding a different lock, it will only increase lock contention. If the server is kicking multiple locks, you don't want starvation of any of them either. CPU distribution should be done fairly, not arbitrarily depending on chance. The reason it is okay to move a thread to the front of the queue for IPC is because with IPC you know the task was at the head and running at the time of the call, but has been blocked the whole time since then, so moving it ahead again is not unfair towards other tasks. There are no such guarantees with notifications. Maybe it's not as bad as I think it is because scheduling will happen when a tasks runs out of its time-slice, but even then this behaviour will cause extra, unnecessary scheduling and minor unfairness, more so on SMP.
Yes.
I'll make a test to see if this happens or not. |
In practice I doubt it's a serious issue. If you have multiple threads waiting on the same notification, then that seems to only make sense if they are multiple working threads, in which case their relative execution order shouldn't matter. But I really don't see why this should affect any other threads at the same prio that have nothing to do with that particular notification. |
My understanding is that the actual behavior is that B will be executed before A. However if C is at the same priority as A and B then A will be executed before B. For a system that has lots of threads at priority P, the current implementation is able to force a higher amount of temporal locality by allowing threads A and B to execute immediately after thread C completes. I can see how this would lead to better latency for interrupt handling and slightly better cache utilization if there is less cache evictions required when executing all the other threads at prio P before getting to A and B. |
I fear it is, as it gives unpredictable behaviour and unfairness, with no control by high priority tasks about how lower priority tasks are scheduled.
If you look at one incident, then order doesn't matter. If you look at the overall system behaviour over time, all tasks should get their fair CPU share and similar scheduling latency. Guaranteeing this is very hard without round-robin guarantees.
Because the higher priority task can signal multiple notifications, and if it does that a lot all the time (e.g. IRQ triggered), everything gets messed up. We have a high priority ethernet task that distributes incoming packets among many worker threads, and it does that by queueing it in shared memory and signalling the tasks via their own notification. As we have many packets per second, under high load it will be signalling all tasks all the time. The current behaviour means that the first task to be woken up to handle a packet will be the last to actually finish handling it, leading to high packet handling latency and jitter. There is a collapse in performance when increasing the load and number of threads past a certain point. I haven't looked into it in detail yet, but I suspect this issue is partially to blame.
This is a bad assumption, where is this temporal locality coming from? Say thread T was running prior to C waking up, surely continuing with T would have higher temporal locality than running A and B first? The working set of a running thread is unlikely to be smaller than any data being passed from C to A or B, even if it's partially thrashed by C. Especially if T was woken up by C before. |
@Indanz: I don't disagree with you at all. The "should not affect" was not meant to refer to the current implementation, but to what I consider the right behaviour. |
@kent-mcleod: This is kernel-enforced policy, which we should not have. Locality (and whether it matters) depends on the system architecture, the kernel second-guessing userland is the wrong approach for a microkernel. If userland wants to optimise locality then it should do that with the existing mechanisms (incl prios) rather than relying on the kernel to do it. |
@gernotheiser Yeah, sorry, I got confused. |
Fwiw, I don't see any problem from the verification side with the proposed change, I don't think we have any properties/invariants that would break because of it. It'd just be a bunch of work on spec/proof adjustments. As @gernotheiser said, the current behaviour was likely implemented in analogy with IPC. |
(I'm only considering non-mcs and non-smp here) In the current implementation, whenever a thread is made runnable it is almost always inserted into the front of the queue for it's priority(LIFO). Threads are only appended to the back of the ready queue (Roundrobin/FIFO) in the following cases:
All other conditions (that I inspected at least) insert threads in the front of the queue. I've always understood this to be deliberate because it minimizes latency for newly awoken threads while still implementing round-robin for long-running tasks that are usually executing for most of their time-slices by appending them only when they use up their time-slice.
The kernel tracks the time-slice remaining for each thread as the number of ticks and stores this in the TCB. The tick count is only decremented if the thread is active when there is a kernel timer tick and the thread is only sent to the back of the scheduler queue when the count reaches 0 at which point the count is reset. Before that point, the kernel appears to consider that a thread hasn't used up its timeslice and so is able to be inserted at the front of the queue of its priority whenever it becomes unblocked. So deciding to change the implementation to start putting all unblocked threads at the back of the ready queues can potentially have a big impact to the performance of existing systems. The current approach has remained largely unchanged for a while as it was established before my time...
From the typically lower response time from the LIFO scheduling behavior and the lower number of context switches between when the task becomes available and is eventually executed.
But even the 20-year paper claimed that microkernel's can't get away from scheduling policy in the kernel :) |
I don't think that's what it says. It said that it's still unresolved – and we solved it meanwhile with MCS. Yes, that still has a policy in the kernel, but it's a primitive policy on which you can build others, and it's well specified. And no-where did we justify ad-hoc policies that try to second-guess userland intentions. In fact, if anything is specified (informally) about the behaviour of our scheduling queues then it's "round robin", which the present implementation violates. |
Not 100% sure this is applicable in your context but I think it is so…
Putting the woken up notifications on the back of the queue can have
performance benefits as long as you implement work queues between
communicating services as it makes it possible to batch up operations for
improved efficiency.
The sizes of the batches is dynamic and increases with the system load so
the efficiency increases with load. At low loads, the batch size is small
so latency is low, at higher loads the system is more efficient, latency is
increased but is still lower than it would be if the system was less
efficient.
A couple of related (expired) patents:
https://patentimages.storage.googleapis.com/64/b7/69/9efbb8725ef2f0/US7415559.pdf
https://patents.google.com/patent/GB2348306A/en
…On Wed, 24 Aug 2022 at 09:25, Gernot Heiser ***@***.***> wrote:
But even the 20-year paper claimed that microkernel's can't get away from
scheduling policy in the kernel :)
I don't think that's what it says. It said that it's still unresolved
– and we solved it meanwhile with MCS. Yes, that still has a policy in the
kernel, but it's a primitive policy on which you can build others, and it's
well specified. And no-where did we justify ad-hoc policies that try to
second-guess userland intentions.
In fact, if anything is specified (informally) about the behaviour of our
scheduling queues then it's "round robin", which the present implementation
violates.
—
Reply to this email directly, view it on GitHub
<#902 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCZLLUB6VQZJBPCDRVACCDV2XMAHANCNFSM57EU3JQQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
It doesn't seem to be:
Please only post links to scientific papers and not bullshit software patents for obvious solutions to specific problems. |
Both are dependent on appending the notification rather than adding it to
the head of the queue.
…On Wed, 24 Aug 2022 at 10:10, Indan Zupancic ***@***.***> wrote:
Not 100% sure this is applicable in your context but I think it is so…
It doesn't seem to be:
- Your first link is about IRQ coalescing in a slightly alternative
way.
- Your second link is about a technique to improve the i-cache hit
ratio in a data processing system.
Please only post links to scientific papers and not bullshit software
patents for obvious solutions to specific problems.
—
Reply to this email directly, view it on GitHub
<#902 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCZLLSK63GIPXJYWYKDYT3V2XRH7ANCNFSM57EU3JQQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
FIFO ordering is standard for queue handling, it's seL4 that is strange in switching to LIFO sometimes. The links you posted were not about reasoning of FIFO versus LIFO ordering, but about other details of work queues and don't help with this issue, as far as I could tell. It's fine to try to help by posting links to possibly relevant literature, but please post to scientific documents instead of patents. |
The problem is that putting unblocked tasks at the head of the queue often makes sense. Of the list of cases you posted the only problematic one is unblocking because of a notification, the others either make sense or are practically harmless. It just leads to weird behaviour when unblocked tasks start interrupting and overtaking each other, then it may lead to unfair scheduling, possibly starvation and bad performance if no one gets anything done before being interrupted. It's okay if the kernel decides to give an unblocked task an advantage, but it shouldn't immediately take it away again, and it shouldn't lead to starvation (because an unblocked task that gets another signal isn't put ahead of the queue, so the longer it's interrupted, the less chance it has to ever catch up). With many tasks, SMP and high load the problem gets worse. Lowering scheduling latency for some tasks should not lead to unreasonable worst-case scheduling latency for others. Conceptually I think sorting the release queue by last run time would probably be best: The task that waited the longest runs first. (To avoid unnecessary scheduling the "last run time" should only be updated when a task blocks or gets pre-empted by an equal priority task.) On MCS this may be possible to implement without much extra overhead if it's combined with other sorted queues, but for non-MCS this seems impractical. What guarantees does seL4 want to give about CPU time fairness between equal priority threads and (worst-case) scheduling latency? |
The present behaviour certainly leads to a DOS attack on non-MCS. Threads A and B both run at prio P:
Together they block all other threads at prio P Of course, the unprincipled pre-MCS scheduling allows other DOS attacks (invoking a higher-prio server in a tight loop)... |
Deliberate DOS is something else, that's usually easily done. I'm more concerned about behaviour under high load with many threads and signals. How is this problem addressed with MCS? I understand how a DOS attack is harder to do with MCS, as scheduling happens per task period and not per timer tick, but I don't see how the situation is much better with many threads under high load. Even if you make all threads periodic and divide a period between all tasks, the scheduling latency would still be bad and all over the place, with extra scheduling happening overall. I don't think your example would work as equal priority notifications don't put the woken up task at the head of the queue (@kent-mcleod 's point 3, the |
Stepping back... We have to accept that pre-MCS scheduling is unprincipled and basically impossible to reason about. Ad-hoc policies (aka hacks) try to work around some of the deficiencies but create others. The model is essentially unfixable without major surgery, and that's exactly what MCS does. In that respect, I'm not overly fussed about pre-MCS where, scheduling wise, we're moving deck chairs on the Titanic. But I really need MCS to be clean and principled. And if there are features that aren't then they need cleaning up. And one such principle is that threads at the same prio are scheduled round-robin. Any queue-jumping must be principled, only happen in well-justified cases. One such justification is time-slice donation. If a thread calls a higher-prio passive server and that server returns, then the caller is still on the original budget given to it by the scheduler. It should keep executing until it forfeits (by blocking or yelding) or exhausts its budget. That's pretty much what a budget is about. In that sense I cannot see any justification for queue jumping in the cases @kent-mcleod lists above, other than
(2) is presently only the explicit yield(), but could be extended to include other operations that operate on a thread at the same prio. (Donating budget to a lower-prio thread can only work if the operation blocks the caller.) |
On Tue, 23 Aug 2022 at 08:41, Indan Zupancic ***@***.***> wrote:
We have a high priority ethernet task that distributes incoming packets
among many worker threads, and it does that by queueing it in shared memory
and signalling the tasks via their own notification. As we have many
packets per second, under high load it will be signalling all tasks all the
time. The current behaviour means that the first task to be woken up to
handle a packet will be the last to actually finish handling it, leading to
high packet handling latency and jitter.
There is a collapse in performance when increasing the load and number of
threads past a certain point. I haven't looked into it in detail yet, but I
suspect this issue is partially to blame.
If you design your system so (a) the efficiency (e.g packets per cpu cycle)
increases with increasing workload and (b) you get the relative priority of
processing correct for the different packet types of the protocol — for
example you might need to process ack packets with higher priority than
bulk data to avoid bulk data delaying ack packets and causing the bulk data
flow to oscillate — then you should end up with a nice stable hockey stick.
The links I provided are good for (a). (b) needs to be done for the
specific protocol.
The protocol has to be good too obviously.
If you think about the information content of “signalling all tasks all the
time” symbol 1 probability is 1, symbol 0 probability 0, you’ll see the
information content is zero. Same as not signalling any of the tasks -
information content still zero but significantly more efficient.
The proportion of work done in industry that is published is even smaller
than the proportion that is patented.
Enjoy your high performance, multithreaded real time networking project.
Sounds like my kind of fun.
Message ID: ***@***.***>
… |
Further to my comments above: I think this can all be done very cleanly. In MCS terminology:
This means
We could add a yield-and-wait invocation, which donates the budget and blocks the caller. That way, the budget is exactly what it is meant to be: the right to use the CPU for a while at the present time, but adds a clean delegation mechanism, without any ad-hocisms. In particular it means holding a time slice doesn't allow you to mess with scheduling, only with resources you already have. Non-MCS: replace "budget" with "time slice" (but, as I said, I care less about that version). This should make it possible to produce, in a principled way, the results the existing hacks are trying to achieve. |
How does yield map onto the existing mechanisms? E.G. A high priority thread performs send on an endpoint (or notification) and unblocks a low priority thread. Both are now runnable. Is this a yield operation? Or is yield only the rendezvous operations which result in the active thread becoming blocked? |
A and B wouldn't be able to prevent threads at the same priority from running indefinitely because once their timeslice expires, they go to the back of the ready queue and aren't able to receive notifications until they are scheduled again. If their time-slice was consumed each time they block and was replenished each time they are unblocked then they could prevent other threads from running, but the kernel doesn't implement time-slices this way. Currently, they can't block other threads of the same priority for longer than if they were running continuously for their entire time-slice. If they're able to leverage a higher priority thread to use its time-slice to run while they are blocked and then use a signal to unblock them, that's essentially using privilege escalation to prevent other threads of the old priority from running for a longer time because of the time spent running at the higher privilege level. |
Inserting a call to seL4_Yield() before the call to seL4_Wait() would explicitly send the thread to the back of the ready queue before receiving the next notification. |
Maybe the kernel needs an additional scheduler queue... If we assume there's two types of threads: ones that have more work to perform than their timeslice allows, and ones that usually complete within their time slice (CPU bound vs IO bound), the kernel doesn't distinguish between them and you can't always split them by static priority assignment because whether a thread is CPU bound or IO bound can be transient. Non-MCS doesn't have a release queue and so always puts CPU bound threads at the back of the ready queue each time they consume their timeslice. With only one queue, if I/O bound threads are always put at the back each time they're unblocked, it's easy for latency and utilization to degrade compared to LIFO if they often stuck behind CPU bound threads always using their full timeslice when I/O threads hardly use their timeslice each time. The way that the kernel currently effectively schedules threads is that it uses LIFO for threads that are I/O bound and moving from blocked to unblocked, and FIFO for threads that are CPU bound moving through the back of the ready queue each time. But because the kernel doesn't distinguish, once an I/O bound thread spends enough time on the CPU, it will finish its timeslice and be moved to the back of the ready queue so overall, I/O bound threads cannot prevent CPU bound threads from running more than if the I/O bound threads were CPU bound in the first place. But because the kernel only has one queue, it can only insert I/O bound threads onto the front and thus schedules I/O bound threads at the same priority in LIFO order. If there was a separate queue for unblocked threads, then the kernel would be able to append threads to this in FIFO order, and only pull off the front of the existing ready queue if this new queue is empty. This is more applicable for non-MCS than MCS, but MCS could also benefit as it also has a similar issue with distinguishing between round robin threads and sporadic threads of the same priority. One more point is that the kernel would still benefit from all of the direct thread switching it does which is LIFO, but more about skipping a reschedule operation altogether. So a separate unblocked queue is more for being able to FIFO queue when multiple threads are unblocked by signals or ReplyRecvs, NBSendRecvs and other operations that make multiple threads at a time unblocked that still should execute before the rest of the threads in the ready queue as still have budget/timeslice. |
This is what I expect from seL4 as a user.
From what I understood, they can't cause queue re-ordering because that's not done for same-prio notifications, so they can't stop same prio tasks from running at all. But if a high prio tasks sends the signals and there are other tasks being woken up, they can take over and keep starving the non-notified threads. With enough signals and tasks this can go on indefinite it seems. (Also, it's not necessary to block tasks indefinitely for a DOS attack to be successful, just slowing things down enough is sufficient. That's why doing DOS attacks is easy.)
Is your point that if tasks do a yield before the wait, they can block others indefinitely?
I don't think this would solve the original problem, because the just woken up I/O bound thread is being interrupted by a higher prio task that signals another I/O bound thread. So the kernel would need to remember whether a task is I/O bound or not and put (or rather keep) the pre-empted task on this new queue instead of the ready queue. And then you have the black magic of deciding whether a task is I/O bound or not. A second queue could solve the starvation of I/O bound tasks, but not of CPU bound tasks. It's probably good to realise that at high load I/O bound tasks behave more like CPU bound tasks because of scheduling latency: They may not had time to block. All in all it doesn't seem robust. Mixed CPU bound and I/O bound loads can often be solved via priorities, especially with MCS. And with MCS the period can be configured small enough that scheduling latency is acceptable if many tasks want to run at the same time.
I don't think seL4 has any operations that unblock more than one task at a time. If e.g. a notification with multiple waiters is signalled, only the task at the head of the notification's wait queue is woken up. |
seL4_ReplyRecv can unblock 2 threads if there's a thread blocked on reply and a thread blocked on send on the ep form performing seL4_Send instead of seL4_Call. In this case the current thread will also remain unblocked. |
Notifications should influence scheduling as little as possible. Tasks woken up by notifications should not be put at the head of the release queue.
Enqueue is used to optimise the case where a high priority server replies to a lower priority client. Basically higher priority tasks are allowed to influence the release queue order of lower priority tasks. This makes sense in general.
I don't this makes sense for notifications though. If I understood correctly, this means that if a high priority task notifies multiple tasks, they will be woken up in reverse order. For network packets this means that the last received packets will be handled first instead of the oldest.
For high priority servers it's easy to wake-up multiple tasks by sending multiple notifications. If the newly woken-up tasks keeps getting priority, it could lead to starvation of other equal priority tasks.
For IPC calls the lower priority task being woken up was likely the caller. Although it's possible to handle multiple calls at the same time, that's uncommon, as long blocking operations are generally better implemented with signals.
See here for some context.
The text was updated successfully, but these errors were encountered: