Signal fastpath v2 #793

alwin-joshy · 2022-03-05T10:05:55Z

This is basically a combination of Shane's signal fastpath from last year and Anna's signal/IRQ fastpath.

Shane's fastpath was able to handle signalling to both lower and higher prio threads, but there were some issues related to what needs to be done when scheduling contexts are switched in the case of a lower to higher prio signal. I have looked at the schedule() code to try and figure out what needs to be done. I have been unable to check my progress much as of yet, so this first pull request is mainly for feedback.

Shane's code was working on all platforms but I have only tested on AARCH64 so far, where it passes all of seL4test and performs in seL4bench similar but a little worse than Shane's implementation due to the extra logic I add. Before I put more effort into a full benchmarking, I want to confirm that what I have done is on the right track.

EDIT: I have revised the model to no longer handle any branch that requires a SC switch. This fastpath is for seL4_Signal (an seL4_Send to a notification capability) and is based on the other fastpaths in addition to the sendSignal() function that is called by the slowpath, but omits the cases that result in a scheduling context switch, redirecting to the slowpath for these.

src/arch/arm/32/traps.S

include/fastpath/fastpath.h

src/fastpath/fastpath.c

lsf37 · 2022-03-06T22:11:47Z

Thanks for making a new start with this. On a high level, it still looks fairly large for a fastpath and we should probably look at the design more first before we get into the low-level comments I started with.

@kent-mcleod mentioned that for review and testing, it might be good to concentrate first on one architecture and merge that before we incrementally do others (which could then also go faster). You're already concentrating on AArch64, which is great. I'd suggest to even remove the code for other architectures for now (e.g. the VTX ifdefs), so it is easier to see what is going on.

Essentially we need to be able to convince ourselves that this is:

useful and faster (benchmarking, triggers when we want it)
always equivalent to the slowpath (review, arguments and a lot of testing)
provable (currently will have to be based on review only, things that help are: small size, low complexity (not too much reordering of actions wrt slowpath etc))
maintainable (not too large/complex)

src/fastpath/fastpath.c

include/kernel/thread.h

kent-mcleod · 2022-03-07T00:29:05Z

@kent-mcleod mentioned that for review and testing, it might be good to concentrate first on one architecture and merge that before we incrementally do others (which could then also go faster). You're already concentrating on AArch64, which is great. I'd suggest to even remove the code for other architectures for now (e.g. the VTX ifdefs), so it is easier to see what is going on.

Also it would be good to outline in text what subset of signal operations need fastpathing.

Signalling without context switching?
Application signalling with context switching to a higher priority thread?
IRQ signalling with context switch to higher priority thread?

kent-mcleod · 2022-03-07T00:38:51Z

Application signalling with context switching to a higher priority thread?

If this is a hotpath, then how is it different from fastpathing seL4_Send on an endpoint?

pingerino · 2022-03-20T22:44:14Z

Thanks for raising this for early feedback, this is excellent as it helps us get on the same page before too much time is invested.

Please state explicitly what is being fastpathed and why. If too much is included in the fast path, it won't be faster anymore, and will result in a total overall slowdown for the kernel. Similarly, to actually verifiy a fastpath is a huge amount of proof effort (it''s a refinement showing that the fastpath is an exact subset of the slowpath), and this won't be done without a really convincing usecase (and even then, there needs to be time and resources available to do it.

gernotheiser · 2022-03-21T02:35:50Z

Signal needs to be fast, whether it wakes the other thread or not, both cases are important for asynchronous synchronisation protocols as we are using for high-throughput device drivers.
I have a hard time with the argument that packing in too much makes it slower overall, the numbers show otherwise. The inherent cost of signal should be little over that of a null syscall if no context switch happens, to somewhat above a fastpath IPC where it does. At the moment its several times as expensive.

alwin-joshy · 2022-03-21T03:49:41Z

Thanks for raising this for early feedback, this is excellent as it helps us get on the same page before too much time is invested.

Please state explicitly what is being fastpathed and why. If too much is included in the fast path, it won't be faster anymore, and will result in a total overall slowdown for the kernel. Similarly, to actually verifiy a fastpath is a huge amount of proof effort (it''s a refinement showing that the fastpath is an exact subset of the slowpath), and this won't be done without a really convincing usecase (and even then, there needs to be time and resources available to do it.

Thanks for your comments Anna. The point of this PR was mainly just to clean up Shane's PR a bit while trying to fix some of the issues it has related to the switching of scheduling contexts in a signal to a higher priority thread. The fast-path handles all three states of the notification. These can be further split up into two main paths, the one where we signal an equal or lower priority thread and the one where we signal a higher priority thread. I looked at your implementations of the irq and signal fastpaths as well to try and get some inspiration from different places before ultimately ending up with what I have here.

The former is implemented in much the same way as Shane's PR and your code samples which are linked at the top. It just checks that the thread either already has a scheduling context, or one can be donated from the notification and if so, enqueues it before restoring the current thread, otherwise deferring to the slowpath. I think this case is fairly straightforward and seems to work, and the performance increase I observed is basically the same as what was observed by Shane (a fairly significant improvement).

The other case is trickier, because when we are signalling a higher prio thread, the correct behavior is to switch to the signalled thread. This was mostly modeled off the existing code for seL4_Call, but it can't be the exact same because of the different conditions. The seL4_Call fastpath does not fastpath the case where a scheduling context switch happens - only where the caller donates its scheduling context to the callee. In a signal to a higher priority thread, there is no case where this happens as the signalling thread is not blocked and always keeps its scheduling context. A scheduling context change will always occur. This is problematic because this is somewhat uncharted territory AFAIK. I noticed that you may have done a prototype for something similar to this in your IRQ fastpath, but this doesn't include anything equivalent to awaken or check_domain_time(), which seem to be necessary steps in schedule().

I'm at a bit of a roadblock here because I'm not sure about some things like why it is okay for awaken() be left out of the other fastpaths - is it guaranteed that there will not be anything in the release queue which can be woken up? Besides from that and check_domain_time(), I think I have mostly accounted for the other extra things which need to be done in this path.

pingerino · 2022-03-21T04:50:07Z

@alwin-joshy thanks for the reply.

I am indeed quite familiar with the issues that come up when writing fastpaths for the kernel, especially the scheduling context parts, since I designed this part of the kernel. :)

I'm still not clear on the question "what is being fastpathed". We need to carefully define the fastpath conditions, and potentially benchmark their impact, while considering the verification implications.

Important q's to ask:

which signal is being fastpathed here - just irqs? or async send? what about timer irqs for the kernel scheduler?
are you fastpathing the case where the irq/signal goes to an endpoint in the idle state?
do you fastpath when no context switch is required?
do you fastpath when we do a context switch to a higher priority thread?
- if so when? The existing fastpath doesn't do this for good reason. if you aren't careful, you will bring in the entire logic of the scheduler. You can assume the thread has enough budget, or it wouldn't be on an endpoint queue IIRC. I would keep awaken out without further evidence - It's unlikely they do need to be woken, as we'd be racing with a timer irq if that was the case.
what about when the thread you switch away from doesn't have enough budget? do you fastpath this too? or count on it being a rare event that doesn't often slow you down?
... there's probably more to consider. One of these q's comes up for every branch in the fastpath. To keep it fast, you need to reduce the amount of branches that have no clear directly (e.g either could be picked),

Very long term there's also the argument that by making the slow path faster (lots of opportunity here), we make the entire kernel faster and make life for verification much easier.

@gernotheiser "I have a hard time with the argument that packing in too much makes it slower overall, the numbers show otherwise." -- if I dump the entire kernel into the fast path it will be slower. The nuance here is in how much is fastpathed, and carefully picking the path.

pingerino · 2022-03-21T05:31:16Z

oh and on check domain time - you definitely don't want to fastpath the domain switch.

lsf37 · 2022-03-21T06:29:49Z

@gernotheiser "I have a hard time with the argument that packing in too much makes it slower overall, the numbers show otherwise." -- if I dump the entire kernel into the fast path it will be slower. The nuance here is in how much is fastpathed, and carefully picking the path.

Completely agree. Just putting out there that awaken is a while loop with non-trivial worst case conditions. I don't see this making much sense on a fastpath. You could try to catch the case where the loop collapses to straight-line code. Maybe that would work, but I haven't looked into how feasible it is to determine that up front.

src/fastpath/fastpath.c

gernotheiser · 2022-03-26T06:28:09Z

@gernotheiser "I have a hard time with the argument that packing in too much makes it slower overall, the numbers show otherwise." -- if I dump the entire kernel into the fast path it will be slower. The nuance here is in how much is fastpathed, and carefully picking the path.

Completely agree.

Let's please stay factual and avoid arguing with dodgy assertions. The claim that fastpathing more slows town the kernel is just plain wrong, and the proof is the original L4 kernel [Liedtke, SOSP'93]: It was micro-optimised to the last line, something you could reasonably describe as "dumped the entire kernel into the fastpath". Yet it performed as close to the hardware limit as it gets (despite its overly complex IPC mechanism).

I don't see why, properly done, adding any amount of fastpath code would slow down IPC by more than a few cycles, certainly well below 5% (and I've never seen a realistic use case where 5% IPC slowdown would have a statistically significant effect on system performance).

gernotheiser · 2022-03-26T06:28:24Z

Let's get back to the actual issue of what makes sense to fastpath on the signal syscall. Clearly, a fastpath that will not be verified for the foreseeable future makes little sense, so the real question is not how much fastpathing an operation may slow down others (it won't) but whether the performance gain justifies the verification cost. That's exactly why I keep stating that proposed changes need to demonstrate significant benefits (which might be performance or security) to realistic use cases.

I did overestimate the performance improvements possible in the signal-to-high case (limited by the inherent complexity of changing the SC during the context switch). The latest data I have (thanks @alwin-joshy for providing these) are (all on Odroid C2):

120 cy null syscall
249 cy IPC fastpath
175/410 cy signal-to-low slowpath/fastpath (diff=235 cy)
423/692 cy signal-to-high slowpath/fastpath (diff=269 cy)

Our current use case is high-throughput networking, using fully asynchronous client-driver communication [Leslie et al, 2005].

My estimate from the measurements we have ATM is that the signal-to-low fastpath will improve system performance by about 5% once we peel back overheads introduced by user-level layers (could be more, as I suggest there are other sub-optimal operations), and the signal-to-high fastpath would give a similar benefit (as the slow-fast differences are similar).

However, signal-to-high is a synchronous context switch, and could, in principle, be replaced by IPC, and the difference between fastpath IPC and slowpath signal is 443 cy, getting close to twice the gain. However, the trade-off there si somewhat complex, as signalling gives the high thread (driver) a fresh budget, while with IPC the driver runs on the client budget, so it's not clear how much is left. This makes reasoning about rate limiting difficult.

I short, I have to agree that the case for fastpathing signal-to-high has not (yet) been made and we should (for now) proceed with fastpathing just the non-context-switching signal operation.

Indanz · 2022-03-30T12:43:32Z

Very long term there's also the argument that by making the slow path faster (lots of opportunity here), we make the entire kernel faster and make life for verification much easier.

This always puzzled me. Wouldn't almost all speedup of the fast-path code be achievable by just re-ordering the slow path code? Now work is done double for the slow path, making the slow path slower. The current split makes sense when the fast path code is not verified, but otherwise it seems a bit strange.

lsf37 · 2022-03-30T22:20:52Z

This always puzzled me. Wouldn't almost all speedup of the fast-path code be achievable by just re-ordering the slow path code?

That's really what the fast path is: in some circumstances it is permissible to reorder operations, fuse them, skip certain checks, etc, but not in the general case. An example would be knowing that you don't have to look up the IPC buffer in the fast path. The fast path is mainly establishing that those conditions are true and at the same time is trying to hit the cases that are frequent or important. Pulling too much into the fast path means you aren't really optimising much, because you have not established enough conditions to do much, so you're just executing the same code as the slow path, but you have copped additional maintenance and verification cost.

E.g. pulling in the awaken loop into the signal fast path will not make anything faster in waking up threads. It might make some cases faster overall because the code path leading up to awaken has established conditions for optimisation and has benefited from that. So the question is: is that worth the cost. Not necessarily run-time cost, but the cost of slowing down maintenance and development in the future, because you have now duplicated fairly complex logic.

Now work is done double for the slow path, making the slow path slower. The current split makes sense when the fast path code is not verified, but otherwise it seems a bit strange.

The IPC fast path is verified for AArch32 in hyp and non-hyp configs, and the plan is to do the same for AArch64 and MCS RISC-V, including the new fast paths that are coming in.

The slow path does get slower with every check that you need to add in the fast path, but Gernot is right that it is not really a significant fraction over the existing slow path cost. If the aim is to reduce the slow path time overall, it is still going in the wrong direction, but it is not really going to tip the scales.

The main obvious optimisations are already on the slow path, but there are one or two that could still be done there that would speed things up and make the difference between slow/fast path less pronounced. I agree with Anna that this is long term the better solution. For instance looking up the IPC buffer can be reduced and sometimes removed on the slow path as well if we pass more information to some functions and re-organise some things. That is a fairly high one-time verification effort, but I think it'd be worth it, because it makes everything faster and it doesn't duplicate any logic. Esp if you compare this to pulling more loops into the fast path that are just going to have the same execution time. If the lookup path can be made faster on the slow path, you could end up with fairly comparable times for those.

In any case you'll still want a fast path for very frequent operations where you want to squeeze out the last bit of performance that you can, but if the slow path is faster that need should be reduced and I think it would make sense to do that first, before putting in a lot of work for other options.

Just to clarify: I'm not arguing against a signal fast path, I think that makes perfect sense to do now and is overdue to add. I'm arguing against pulling in too many cases into it that will make it hard to change things in the future and that don't necessarily add benefit proportional to the cost.

(edit: the IPC buffer obviously doesn't enter into the signal fast path, that was just a general example)

src/fastpath/fastpath.c

src/object/notification.c

alwin-joshy · 2022-12-07T03:30:18Z

If there are no other comments, I am planning to squash the commits tomorrow morning.

alwin-joshy · 2022-12-08T09:21:17Z

I have squashed the git history. The unsquashed version can be found here

Indanz

Minor nitpicking, feel free to ignore.

src/fastpath/fastpath.c

The signal fastpath aims to optimize the seL4_Signal operation. In this commit, it is implemented for MCS AARCH64 (SMP and non-SMP). The fastpath does not include the case where signaling results in a higher priority thread being unblocked and made available for scheduling (on any core). It does not fastpath the case where the signaled thread is donated a scheduling context and has its FPU state saved in the FPU of a core. Co-authored-by: Shane Kadish <shane.kadish@csiro.au> Signed-off-by: Alwin Joshy <joshyalwin@gmail.com>

lsf37

From the verification side I'm now happy with it as well. I can't promise that it's equivalent or fully in the verification subset without actually doing the verification, but afaics it should be.

The changes do not affect current verified configs, so no breakage to be expected (the preprocess differences are harmless).

lsf37 · 2022-12-11T23:11:18Z

Will try to get a full passing HW run and sel4bench run. Is there anything else to be done before we merge this?

Indanz · 2022-12-12T08:37:45Z

Will try to get a full passing HW run and sel4bench run. Is there anything else to be done before we merge this?

I would say add support for x86 and RISCV before merging, except if that isn't trivial, but it seems to be. It looks like it's just the fastpath syscall assembly calls for signals that's missing.

alwin-joshy · 2022-12-12T08:59:37Z

I think that's mostly correct with the exception of some extra stuff required for the VTX configuration on x86. The reason I haven't implemented the other architectures is because it was suggested to complete and merge the implementation on one first and then add the others incrementally.

lsf37 · 2023-01-09T21:48:13Z

Proofs and hardware tests have passed, will merge once the updated version is through the required checks.

gernotheiser · 2023-01-09T22:52:21Z

👍🏽

lsf37 · 2023-01-09T23:17:19Z

🎉

The signal fastpath aims to optimize the seL4_Signal operation. In this commit, it is implemented for MCS AARCH64 (SMP and non-SMP). The fastpath does not include the case where signaling results in a higher priority thread being unblocked and made available for scheduling (on any core). It does not fastpath the case where the signaled thread is donated a scheduling context and has its FPU state saved in the FPU of a core. Co-authored-by: Shane Kadish <shane.kadish@csiro.au> Signed-off-by: Alwin Joshy <joshyalwin@gmail.com>

alwin-joshy force-pushed the signal_fp3 branch 2 times, most recently from 62638ae to 813663d Compare March 5, 2022 10:11

lsf37 reviewed Mar 6, 2022

View reviewed changes

src/arch/arm/32/traps.S Outdated Show resolved Hide resolved

lsf37 reviewed Mar 6, 2022

View reviewed changes

include/fastpath/fastpath.h Outdated Show resolved Hide resolved

lsf37 reviewed Mar 6, 2022

View reviewed changes

include/fastpath/fastpath.h Outdated Show resolved Hide resolved

lsf37 reviewed Mar 6, 2022

View reviewed changes

include/fastpath/fastpath.h Outdated Show resolved Hide resolved

lsf37 reviewed Mar 6, 2022

View reviewed changes