Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append instead of enqueue tasks woken up by notifications #902

Open
Indanz opened this issue Aug 21, 2022 · 28 comments
Open

Append instead of enqueue tasks woken up by notifications #902

Indanz opened this issue Aug 21, 2022 · 28 comments

Comments

@Indanz
Copy link
Contributor

Indanz commented Aug 21, 2022

Notifications should influence scheduling as little as possible. Tasks woken up by notifications should not be put at the head of the release queue.

Enqueue is used to optimise the case where a high priority server replies to a lower priority client. Basically higher priority tasks are allowed to influence the release queue order of lower priority tasks. This makes sense in general.

I don't this makes sense for notifications though. If I understood correctly, this means that if a high priority task notifies multiple tasks, they will be woken up in reverse order. For network packets this means that the last received packets will be handled first instead of the oldest.

For high priority servers it's easy to wake-up multiple tasks by sending multiple notifications. If the newly woken-up tasks keeps getting priority, it could lead to starvation of other equal priority tasks.

For IPC calls the lower priority task being woken up was likely the caller. Although it's possible to handle multiple calls at the same time, that's uncommon, as long blocking operations are generally better implemented with signals.

See here for some context.

@pingerino
Copy link
Contributor

pingerino commented Aug 23, 2022

"Notifications should influence scheduling as little as possible." -- this is not obviously true to me. It's not obvious what the right thing is here at all to me without thinking deeply about it.
Perhaps @kevinelp or @gernotheiser can remember the reasoning for the current logic?

The main use case I can think is to skip threads stuck on a spin lock that might be waiting for a thread blocked on a notification to kick the lock. Deeper thinking needed here to validate that's a sensible scenario though .

@gernotheiser
Copy link
Member

I cannot be sure why the implementation ended up the way it is, but I suspect it goes back to originally notifications sharing most of the code path with IPC (that was back in the days when they were misleadingly called "asynchronous IPC", which caused no end of confusion, see https://trustworthy.systems/publications/nictaabstracts/Heiser_Elphinstone_16.abstract).

Conceptually I agree with @Indanz: Notifications should not change scheduling more than necessary, and in particular should not re-order round-robin queues (that's sort-of implied in "round robin"). In general, system calls should avoid side effects that are not required.

Having said that, I'm not sure I exactly understand the situation here. Is the scenario you're describing the following?

  1. thread A, B, both of prio P, are blocked on notification N
  2. thread C with prio >P signals N, A is moved from N's queue to the ready queue of prio P
  3. thread C signals N again, B is moved from N's queue to the ready queue of prio P
  4. C blocks, and the highest runnable prio is now P

In that scenario I'd expect A to precede B in the ready queue, behind any threads that were already in the queue of prio P before the first signal. If the order is the opposite then that's the wrong behaviour in my view. Now I may completely misread the scenario, in which case I'd appreciate if someone enlightened me.

(This is likely one of the things around scheduling not explicitly specified, so it would not an implementation bug and couldn't be captured by verification, but it'd be a bug in the informal spec that was implemented in lieu of a formal one. And a reminder that we'll eventually have to specify more of the scheduling behaviour to fully support reasoning about timeliness.)

@Indanz
Copy link
Contributor Author

Indanz commented Aug 23, 2022

The main use case I can think is to skip threads stuck on a spin lock that might be waiting for a thread blocked on a notification to kick the lock.

Blocking while holding a spinlock would be bad locking design. But assuming a more sensible variant, why would a just woken up thread be put ahead of an already running one? If the running one is holding a different lock, it will only increase lock contention. If the server is kicking multiple locks, you don't want starvation of any of them either. CPU distribution should be done fairly, not arbitrarily depending on chance.

The reason it is okay to move a thread to the front of the queue for IPC is because with IPC you know the task was at the head and running at the time of the call, but has been blocked the whole time since then, so moving it ahead again is not unfair towards other tasks. There are no such guarantees with notifications.

Maybe it's not as bad as I think it is because scheduling will happen when a tasks runs out of its time-slice, but even then this behaviour will cause extra, unnecessary scheduling and minor unfairness, more so on SMP.

Is the scenario you're describing the following?

Yes.

In that scenario I'd expect A to precede B in the ready queue, behind any threads that were already in the queue of prio P before the first signal. If the order is the opposite then that's the wrong behaviour in my view.

I'll make a test to see if this happens or not.

@gernotheiser
Copy link
Member

In practice I doubt it's a serious issue. If you have multiple threads waiting on the same notification, then that seems to only make sense if they are multiple working threads, in which case their relative execution order shouldn't matter. But I really don't see why this should affect any other threads at the same prio that have nothing to do with that particular notification.

@kent-mcleod
Copy link
Member

In that scenario I'd expect A to precede B in the ready queue, behind any threads that were already in the queue of prio P before the first signal. If the order is the opposite then that's the wrong behaviour in my view. Now I may completely misread the scenario, in which case I'd appreciate if someone enlightened me.

My understanding is that the actual behavior is that B will be executed before A. However if C is at the same priority as A and B then A will be executed before B.

For a system that has lots of threads at priority P, the current implementation is able to force a higher amount of temporal locality by allowing threads A and B to execute immediately after thread C completes. I can see how this would lead to better latency for interrupt handling and slightly better cache utilization if there is less cache evictions required when executing all the other threads at prio P before getting to A and B.

@Indanz
Copy link
Contributor Author

Indanz commented Aug 23, 2022

In practice I doubt it's a serious issue.

I fear it is, as it gives unpredictable behaviour and unfairness, with no control by high priority tasks about how lower priority tasks are scheduled.

If you have multiple threads waiting on the same notification, then that seems to only make sense if they are multiple working threads, in which case their relative execution order shouldn't matter.

If you look at one incident, then order doesn't matter. If you look at the overall system behaviour over time, all tasks should get their fair CPU share and similar scheduling latency. Guaranteeing this is very hard without round-robin guarantees.

But I really don't see why this should affect any other threads at the same prio that have nothing to do with that particular notification.

Because the higher priority task can signal multiple notifications, and if it does that a lot all the time (e.g. IRQ triggered), everything gets messed up.

We have a high priority ethernet task that distributes incoming packets among many worker threads, and it does that by queueing it in shared memory and signalling the tasks via their own notification. As we have many packets per second, under high load it will be signalling all tasks all the time. The current behaviour means that the first task to be woken up to handle a packet will be the last to actually finish handling it, leading to high packet handling latency and jitter.

There is a collapse in performance when increasing the load and number of threads past a certain point. I haven't looked into it in detail yet, but I suspect this issue is partially to blame.

For a system that has lots of threads at priority P, the current implementation is able to force a higher amount of temporal locality by allowing threads A and B to execute immediately after thread C completes.

This is a bad assumption, where is this temporal locality coming from? Say thread T was running prior to C waking up, surely continuing with T would have higher temporal locality than running A and B first? The working set of a running thread is unlikely to be smaller than any data being passed from C to A or B, even if it's partially thrashed by C. Especially if T was woken up by C before.

@gernotheiser
Copy link
Member

@Indanz: I don't disagree with you at all. The "should not affect" was not meant to refer to the current implementation, but to what I consider the right behaviour.

@gernotheiser
Copy link
Member

@kent-mcleod: This is kernel-enforced policy, which we should not have. Locality (and whether it matters) depends on the system architecture, the kernel second-guessing userland is the wrong approach for a microkernel. If userland wants to optimise locality then it should do that with the existing mechanisms (incl prios) rather than relying on the kernel to do it.

@Indanz
Copy link
Contributor Author

Indanz commented Aug 23, 2022

@gernotheiser Yeah, sorry, I got confused.

@lsf37
Copy link
Member

lsf37 commented Aug 23, 2022

Fwiw, I don't see any problem from the verification side with the proposed change, I don't think we have any properties/invariants that would break because of it. It'd just be a bunch of work on spec/proof adjustments.

As @gernotheiser said, the current behaviour was likely implemented in analogy with IPC.

@kent-mcleod
Copy link
Member

(I'm only considering non-mcs and non-smp here)

In the current implementation, whenever a thread is made runnable it is almost always inserted into the front of the queue for it's priority(LIFO).

Threads are only appended to the back of the ready queue (Roundrobin/FIFO) in the following cases:

  1. When a thread calls seL4_Yield it is inserted at the back of it's ready queue.
  2. When a thread runs out of its timeslice (when its timeslice counter reaches 0 on a timer interrupt).
  3. When a thread becomes ready but is the same priority and domain as the current thread and the current thread isn't also becoming blocked.

All other conditions (that I inspected at least) insert threads in the front of the queue. I've always understood this to be deliberate because it minimizes latency for newly awoken threads while still implementing round-robin for long-running tasks that are usually executing for most of their time-slices by appending them only when they use up their time-slice.
I've listed some examples, but there are quite a lot if you follow SCHED_ENQUEUE() and SCHED_ENQUEUE_CURRENT_TCB in the code:

  • Any time a thread is unblocked from a reply, endpoint or notification object apart from when point 3 matches above
  • Any time a thread is preempted by a higher priority thread becoming runnable (eg by an interrupt or a signal to high prio thread) it is enqueued in the ready queue.
  • cancelBadgedSends, cancelAllSignals: Every thread that is unblocked by cancelBadgedSends is inserted at the front of the queue iteratively from front to back of the endpoint queue.
  • setDomain: The thread is moved to the front of the queue for that domain
  • setPriority: If the thread is runnable it is moved to the front of the queue for that priority.
  • TCB_Resume: The thread is inserted at the front of the queue if it was suspended
  • TCB_WriteRegisters, TCB_CopyRegisters: if the thread is also resumed it moves to the front of the queue.

The kernel tracks the time-slice remaining for each thread as the number of ticks and stores this in the TCB. The tick count is only decremented if the thread is active when there is a kernel timer tick and the thread is only sent to the back of the scheduler queue when the count reaches 0 at which point the count is reset. Before that point, the kernel appears to consider that a thread hasn't used up its timeslice and so is able to be inserted at the front of the queue of its priority whenever it becomes unblocked.

So deciding to change the implementation to start putting all unblocked threads at the back of the ready queues can potentially have a big impact to the performance of existing systems. The current approach has remained largely unchanged for a while as it was established before my time...

This is a bad assumption, where is this temporal locality coming from?

From the typically lower response time from the LIFO scheduling behavior and the lower number of context switches between when the task becomes available and is eventually executed.

This is kernel-enforced policy, which we should not have. Locality (and whether it matters) depends on the system architecture, the kernel second-guessing userland is the wrong approach for a microkernel.

But even the 20-year paper claimed that microkernel's can't get away from scheduling policy in the kernel :)

@gernotheiser
Copy link
Member

But even the 20-year paper claimed that microkernel's can't get away from scheduling policy in the kernel :)

I don't think that's what it says. It said that it's still unresolved – and we solved it meanwhile with MCS. Yes, that still has a policy in the kernel, but it's a primitive policy on which you can build others, and it's well specified. And no-where did we justify ad-hoc policies that try to second-guess userland intentions.

In fact, if anything is specified (informally) about the behaviour of our scheduling queues then it's "round robin", which the present implementation violates.

@heb1001
Copy link

heb1001 commented Aug 24, 2022 via email

@Indanz
Copy link
Contributor Author

Indanz commented Aug 24, 2022

Not 100% sure this is applicable in your context but I think it is so…

It doesn't seem to be:

  • Your first link is about IRQ coalescing in a slightly alternative way.
  • Your second link is about a technique to improve the i-cache hit ratio in a data processing system.

Please only post links to scientific papers and not bullshit software patents for obvious solutions to specific problems.

@heb1001
Copy link

heb1001 commented Aug 24, 2022 via email

@Indanz
Copy link
Contributor Author

Indanz commented Aug 24, 2022

Both are dependent on appending the notification rather than adding it to the head of the queue.

FIFO ordering is standard for queue handling, it's seL4 that is strange in switching to LIFO sometimes. The links you posted were not about reasoning of FIFO versus LIFO ordering, but about other details of work queues and don't help with this issue, as far as I could tell.

It's fine to try to help by posting links to possibly relevant literature, but please post to scientific documents instead of patents.

@Indanz
Copy link
Contributor Author

Indanz commented Aug 24, 2022

So deciding to change the implementation to start putting all unblocked threads at the back of the ready queues can potentially have a big impact to the performance of existing systems.

The problem is that putting unblocked tasks at the head of the queue often makes sense. Of the list of cases you posted the only problematic one is unblocking because of a notification, the others either make sense or are practically harmless.

It just leads to weird behaviour when unblocked tasks start interrupting and overtaking each other, then it may lead to unfair scheduling, possibly starvation and bad performance if no one gets anything done before being interrupted.

It's okay if the kernel decides to give an unblocked task an advantage, but it shouldn't immediately take it away again, and it shouldn't lead to starvation (because an unblocked task that gets another signal isn't put ahead of the queue, so the longer it's interrupted, the less chance it has to ever catch up). With many tasks, SMP and high load the problem gets worse.

Lowering scheduling latency for some tasks should not lead to unreasonable worst-case scheduling latency for others.

Conceptually I think sorting the release queue by last run time would probably be best: The task that waited the longest runs first. (To avoid unnecessary scheduling the "last run time" should only be updated when a task blocks or gets pre-empted by an equal priority task.)

On MCS this may be possible to implement without much extra overhead if it's combined with other sorted queues, but for non-MCS this seems impractical.

What guarantees does seL4 want to give about CPU time fairness between equal priority threads and (worst-case) scheduling latency?

@gernotheiser
Copy link
Member

gernotheiser commented Aug 24, 2022

The present behaviour certainly leads to a DOS attack on non-MCS. Threads A and B both run at prio P:

  • A runs for 90% of its time slice, then signals notification N and waits on notification M
  • B runs for 90% of its time slice, then signals M and waits for N

Together they block all other threads at prio P

Of course, the unprincipled pre-MCS scheduling allows other DOS attacks (invoking a higher-prio server in a tight loop)...

@Indanz
Copy link
Contributor Author

Indanz commented Aug 24, 2022

The present behaviour certainly leads to a DOS attack on non-MCS.

Deliberate DOS is something else, that's usually easily done. I'm more concerned about behaviour under high load with many threads and signals.

How is this problem addressed with MCS? I understand how a DOS attack is harder to do with MCS, as scheduling happens per task period and not per timer tick, but I don't see how the situation is much better with many threads under high load. Even if you make all threads periodic and divide a period between all tasks, the scheduling latency would still be bad and all over the place, with extra scheduling happening overall.

I don't think your example would work as equal priority notifications don't put the woken up task at the head of the queue (@kent-mcleod 's point 3, the was_runnable else-if of schedule). But if A and B can trigger a higher priority task to notify each other, then it would work.

@gernotheiser
Copy link
Member

Stepping back...

We have to accept that pre-MCS scheduling is unprincipled and basically impossible to reason about. Ad-hoc policies (aka hacks) try to work around some of the deficiencies but create others. The model is essentially unfixable without major surgery, and that's exactly what MCS does.

In that respect, I'm not overly fussed about pre-MCS where, scheduling wise, we're moving deck chairs on the Titanic. But I really need MCS to be clean and principled. And if there are features that aren't then they need cleaning up.

And one such principle is that threads at the same prio are scheduled round-robin. Any queue-jumping must be principled, only happen in well-justified cases.

One such justification is time-slice donation. If a thread calls a higher-prio passive server and that server returns, then the caller is still on the original budget given to it by the scheduler. It should keep executing until it forfeits (by blocking or yelding) or exhausts its budget. That's pretty much what a budget is about.

In that sense I cannot see any justification for queue jumping in the cases @kent-mcleod lists above, other than

  1. returning from IPC on the original budget
  2. any operation that yields a current budget

(2) is presently only the explicit yield(), but could be extended to include other operations that operate on a thread at the same prio. (Donating budget to a lower-prio thread can only work if the operation blocks the caller.)

@heb1001
Copy link

heb1001 commented Aug 24, 2022 via email

@gernotheiser
Copy link
Member

Further to my comments above: I think this can all be done very cleanly. In MCS terminology:

  • the budget is the right to use the CPU for a certain amount if time at the present time
  • yield doesn't change the queue order, but allows donation.

This means

  • yield to high forfeits the budget (caller goes to end of queue)
  • yield to blocked forfeits the budget
  • yield to same donates the time slice without changing the queue (other than putting the caller to the end). If the yieldee is at the front of the queue, it gets to run again. That's the right the budget gives it
  • yield to low donates the time slice, incl the caller's prio (i.e. the callee gets to run), caller goes to the end of its queue, the callee queue is unaffected

We could add a yield-and-wait invocation, which donates the budget and blocks the caller.

That way, the budget is exactly what it is meant to be: the right to use the CPU for a while at the present time, but adds a clean delegation mechanism, without any ad-hocisms. In particular it means holding a time slice doesn't allow you to mess with scheduling, only with resources you already have.

Non-MCS: replace "budget" with "time slice" (but, as I said, I care less about that version).

This should make it possible to produce, in a principled way, the results the existing hacks are trying to achieve.

@kent-mcleod
Copy link
Member

kent-mcleod commented Aug 24, 2022

How does yield map onto the existing mechanisms?

E.G. A high priority thread performs send on an endpoint (or notification) and unblocks a low priority thread. Both are now runnable. Is this a yield operation?

Or is yield only the rendezvous operations which result in the active thread becoming blocked?

@kent-mcleod
Copy link
Member

I don't think your example would work as equal priority notifications don't put the woken up task at the head of the queue (@kent-mcleod 's point 3, the was_runnable else-if of schedule). But if A and B can trigger a higher priority task to notify each other, then it would work.

A and B wouldn't be able to prevent threads at the same priority from running indefinitely because once their timeslice expires, they go to the back of the ready queue and aren't able to receive notifications until they are scheduled again.

If their time-slice was consumed each time they block and was replenished each time they are unblocked then they could prevent other threads from running, but the kernel doesn't implement time-slices this way. Currently, they can't block other threads of the same priority for longer than if they were running continuously for their entire time-slice. If they're able to leverage a higher priority thread to use its time-slice to run while they are blocked and then use a signal to unblock them, that's essentially using privilege escalation to prevent other threads of the old priority from running for a longer time because of the time spent running at the higher privilege level.

@kent-mcleod
Copy link
Member

Inserting a call to seL4_Yield() before the call to seL4_Wait() would explicitly send the thread to the back of the ready queue before receiving the next notification.

@kent-mcleod
Copy link
Member

The problem is that putting unblocked tasks at the head of the queue often makes sense. Of the list of cases you posted the only problematic one is unblocking because of a notification, the others either make sense or are practically harmless.

It just leads to weird behaviour when unblocked tasks start interrupting and overtaking each other, then it may lead to unfair scheduling, possibly starvation and bad performance if no one gets anything done before being interrupted.

It's okay if the kernel decides to give an unblocked task an advantage, but it shouldn't immediately take it away again, and it shouldn't lead to starvation (because an unblocked task that gets another signal isn't put ahead of the queue, so the longer it's interrupted, the less chance it has to ever catch up). With many tasks, SMP and high load the problem gets worse.

Lowering scheduling latency for some tasks should not lead to unreasonable worst-case scheduling latency for others.

Maybe the kernel needs an additional scheduler queue...

If we assume there's two types of threads: ones that have more work to perform than their timeslice allows, and ones that usually complete within their time slice (CPU bound vs IO bound), the kernel doesn't distinguish between them and you can't always split them by static priority assignment because whether a thread is CPU bound or IO bound can be transient.

Non-MCS doesn't have a release queue and so always puts CPU bound threads at the back of the ready queue each time they consume their timeslice. With only one queue, if I/O bound threads are always put at the back each time they're unblocked, it's easy for latency and utilization to degrade compared to LIFO if they often stuck behind CPU bound threads always using their full timeslice when I/O threads hardly use their timeslice each time.

The way that the kernel currently effectively schedules threads is that it uses LIFO for threads that are I/O bound and moving from blocked to unblocked, and FIFO for threads that are CPU bound moving through the back of the ready queue each time. But because the kernel doesn't distinguish, once an I/O bound thread spends enough time on the CPU, it will finish its timeslice and be moved to the back of the ready queue so overall, I/O bound threads cannot prevent CPU bound threads from running more than if the I/O bound threads were CPU bound in the first place. But because the kernel only has one queue, it can only insert I/O bound threads onto the front and thus schedules I/O bound threads at the same priority in LIFO order.

If there was a separate queue for unblocked threads, then the kernel would be able to append threads to this in FIFO order, and only pull off the front of the existing ready queue if this new queue is empty. This is more applicable for non-MCS than MCS, but MCS could also benefit as it also has a similar issue with distinguishing between round robin threads and sporadic threads of the same priority.

One more point is that the kernel would still benefit from all of the direct thread switching it does which is LIFO, but more about skipping a reschedule operation altogether. So a separate unblocked queue is more for being able to FIFO queue when multiple threads are unblocked by signals or ReplyRecvs, NBSendRecvs and other operations that make multiple threads at a time unblocked that still should execute before the rest of the threads in the ready queue as still have budget/timeslice.

@Indanz
Copy link
Contributor Author

Indanz commented Aug 25, 2022

But I really need MCS to be clean and principled. And if there are features that aren't then they need cleaning up.

And one such principle is that threads at the same prio are scheduled round-robin. Any queue-jumping must be principled, only happen in well-justified cases.

This is what I expect from seL4 as a user.

A and B wouldn't be able to prevent threads at the same priority from running indefinitely because once their timeslice expires, they go to the back of the ready queue and aren't able to receive notifications until they are scheduled again.

From what I understood, they can't cause queue re-ordering because that's not done for same-prio notifications, so they can't stop same prio tasks from running at all.

But if a high prio tasks sends the signals and there are other tasks being woken up, they can take over and keep starving the non-notified threads. With enough signals and tasks this can go on indefinite it seems.

(Also, it's not necessary to block tasks indefinitely for a DOS attack to be successful, just slowing things down enough is sufficient. That's why doing DOS attacks is easy.)

Inserting a call to seL4_Yield() before the call to seL4_Wait() would explicitly send the thread to the back of the ready queue before receiving the next notification.

Is your point that if tasks do a yield before the wait, they can block others indefinitely?

If there was a separate queue for unblocked threads, then the kernel would be able to append threads to this in FIFO order, and only pull off the front of the existing ready queue if this new queue is empty.

I don't think this would solve the original problem, because the just woken up I/O bound thread is being interrupted by a higher prio task that signals another I/O bound thread. So the kernel would need to remember whether a task is I/O bound or not and put (or rather keep) the pre-empted task on this new queue instead of the ready queue. And then you have the black magic of deciding whether a task is I/O bound or not.

A second queue could solve the starvation of I/O bound tasks, but not of CPU bound tasks. It's probably good to realise that at high load I/O bound tasks behave more like CPU bound tasks because of scheduling latency: They may not had time to block. All in all it doesn't seem robust.

Mixed CPU bound and I/O bound loads can often be solved via priorities, especially with MCS. And with MCS the period can be configured small enough that scheduling latency is acceptable if many tasks want to run at the same time.

operations that make multiple threads at a time unblocked

I don't think seL4 has any operations that unblock more than one task at a time. If e.g. a notification with multiple waiters is signalled, only the task at the head of the notification's wait queue is woken up.

@kent-mcleod
Copy link
Member

I don't think seL4 has any operations that unblock more than one task at a time. If e.g. a notification with multiple waiters is signalled, only the task at the head of the notification's wait queue is woken up.

seL4_ReplyRecv can unblock 2 threads if there's a thread blocked on reply and a thread blocked on send on the ep form performing seL4_Send instead of seL4_Call. In this case the current thread will also remain unblocked.
There's also cancelBadgedSends and other operations that can cause the entire queue of threads on an ep or notification t to become unblocked at the same time, but access to these operations can be somewhat restricted because they require additional caps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants