-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AArch64 support status and issues #3234
Comments
This is not important IMO.
The mrs emulation code in arch/arm64/kernel/cpufeature.c needs to obey ARCH_SET_CPUID. This is going to be a pain because they made us put it in arch_prctl on x86 and arm64 doesn't have arch_prctl.
Similarly, cntvct_read_handler should obey PR_SET_TSC.
Ugh, so they made page faults user visible? What a nightmare. |
This is mainly annoy when you got randomly pinned to a E core and then got stuck there for the rest of the run even if the P core are free to use... I guess some option/default to prefer higher performance core might offset this issue.
So looking at the document for arch_prctl, is it a thing on 32bit x86 before 4.12? It seems that the document still says it's x64 only, even though it also says that
Huh, is reading these values always trapping on linux? The ARM description of this register seems to have a branch that doesn't trap in EL0 and I thought it must be what's the kernel is doing for better performance...
I don't think I understand the full impact of this myself but for the record, I did ask about this right after SVE came out ..... I'm honestly not sure if the kernel or the hardware part would be harder to deal with... |
So it does seem that you added it to x86-32 (torvalds/linux@79170fd, and given the large syscall number I guess I should have guessed...) so at least it should be somewhat similar in this regard? |
Perhaps we should only ever pin to P cores?
Yeah, the manpages are wrong about it being amd64 only.
I believe so.
I don't see how we could deal with it. Recreating the precise state of what is paged in or out is not possible. |
But then you do need to figure out what to do when there's one cortex-x1, three cortex-a78 and 4 cortex-a55.. = = ..... For running parallel tests (rr, or else) it is also not particularly nice..
I assume this info is available in the kernel so at least it is in principle possible with some kernel patch. Even without it, I think we could ask the kernel to pin all the pages (e.g. with |
I don't think mlocking everything is feasible.
How would we fix this in the kernel? It sounds to me like these SVE instructions don't actually generate page faults? |
I mean to make sure the paging for the recording and replaying are identical. |
I actually don't understand how these instructions are supposed to work in practice, if they never trigger page-in. Do you have to try the non-faulting instruction and if you don't get any valid data, retry with a faulting instruction? |
Or do they trigger faults for the first byte but not the rest, or something like that? |
I have never got my hand on actual hardware so I'm not 100% sure, but my understanding is that you would use them in the following pattern.
In the example above, on the next iteration, the first element would be the second element in the previous loop, and it will trigger a real fault and the kernel will do whatever it need to do to handle that, either page in some memory or send a signal. There is also non-fault instructions and although I haven't really seen any explicit document on how they should be used, I assume they'll be used for loop unrolling, i.e. you can load more than one SVE registers per loop. The first one would use first-fault and the rest would use non-fault. |
Assuming that the fault doesn't happen often, I feel like the simplest way to deal with this is binary instrumentation (which can also be used to record ll-sc)... For nf or ff SVE load, we could simply replace it with a normal load and catch the segfault. It might also be possible to just record the value of ffr (the register that returns which element has faulted) but doing that without a usable pointer to memory is going to be difficult. From the code examples ARM has posted in various presentations, it seems that there may be many SVE loops that contains virtually no temporary GP registers that we can overwrite. |
Adding binary instrumentation would be a radical change to rr's architecture and one that I don't think we would take. |
We've made a lot of tradeoffs to avoid requiring full binary instrumentation during recording. That has benefited us by giving us lower single-threaded recording overhead, and a simpler and more maintainable design that doesn't require work for every new instruction as architectures evolve. I think robustly handling all the stuff rr currently handles (e.g. signals, sandboxes, exotic clone() options) while binary instrumentation runs in the recorded tracees would also be pretty complex. Performing full binary instrumentation during recording is not crazy --- UndoDB does it AFAIK --- and would let us choose very different tradeoffs, but this means ultimately you'd want a very different design. E.g. the way we handle CPUID and RDTSC, the way we handle syscalls, maybe even the way we (don't) handle multiple cores would probably all end up in a very different place. I think we'd probably want to rearchitect rr from the ground up, perhaps reusing some of the existing code. It would be a fun project to work on but it's not something I want to work on right now. |
Re the SVE thing. Can we talk to ARM about documenting a mode where those non-/(first-) faulting instructions are turned into regular loads? I believe all chips that supports these SVE instructions have hypervisor accessible patch registers that can change the instructions. It might require some convincing, but is should be technically possible. |
I was mainly thinking if it's possible to do that with minimum refactoring. I was hoping that it these should have minimum interaction with the rest of the code but I'm of course not sure...
You mean some registers that changes specific instructions? Or is it something more generic? |
Yes: |
K, so I assume pretty much the chicken bits. Though these particular ones seems to be documented as should be enabled before mmu….. Theoretically, if there’s a way to have the hardware trap any instruction that we can’t handle (e.g. stxr) or under a condition that we can’t handle (e.g. ldff) then it should be totally fine of course. I’ve just personally never have any experience of convincing multiple vendors to be on board to implement a new feature so far …. |
@yuyichao On your M1, what is your operating system setup? Linux on bare metal, or in a VM under macOS? |
Only bare metal is supported. Apple does not expose the performance counters in VMs. |
The website says:
So - are all/some of the issues mentioned here solved now or are those only related to "other ARM CPUs"? |
Let's leave this open to track these ARM features we might need in the future. |
Is bare metal reasonably usable for browsers? rr allows sandbox escapes, so I don’t know if it is reasonable to use it for a browser that is going to be accessing untrusted web content, which is the usual case. |
@rocallahan I can say that in my experience (developer of @QubesOS), there are times when I wanted to use rr, but I was never able to do so because of the performance counter requirement. Xen doesn’t expose performance counters in VMs, and it is a type 1 hypervisor so everything is a VM. Furthermore, rr allows sandbox escapes, so using it for web browsers accessing untrusted web content is ill-advised outside of a test system. |
This really belongs in another thread. Filed #3705 |
May I ask whether there is any ongoing interest SVE-support enablement? Or have the issues associated with handling first-faulting loads made this a no-go? |
I think binary instrumentation and |
rr status
Testing on my M1 MBA, there are currently < 30 test failures out of 1311 (40 with syscallbuf, see below).The main missing piece from within rr is the syscallbuf. It's actually quite tricky to implement in a way that satisfies all the requirement we have on x86 (in particular, that it should work without a valid stack....). I have a write up and a WIP implementation on this and I'll post a draft PR here after some more clean ups.All tests passes on apple-m1, neoverse-n1/v1, cortex-a77. Syscallbuf is implemented.
Supported hardware
Currently, we support arm-neoverse-n1 and apple-m1. It seems that most of the recent arm cores up to cortex-a78 should also be supported without much issue (a55, a65, a65ae, a75-a78). I assume the upcoming apple-m2 should also work fine as well assuming it's apple-a15 based.
Kernel features required
x86 currently implements three features (that I can tell) that isn't generally implementable on aarch64 without additional kernel support.
mrs
instructions that reads the EL1 cpuid registers and AFAICT doesn't include a way for ptracer to catch it yet.SVE/armv9-a
SVE has a feature that I have always been worrying regarding predictability ever since it comes out. To make it easier to vectorize code with complex loop termination condition, SVE has introduced the first fault (FF) and non-fault (NF) versions of the load instructions. When accessing invalid memory with these, instead of producing a fault, these simply set a mask indicating the fault. Clever use of this would then allow vectorization of string functions (e.g. strlen) since one can perform out-of-bound read without any visible consequences.
The issue I saw with this is that it depends on the OS paging. Even if a page is mapped from the userspace point of view, it may not actually be mapped and depend on how the kernel feel like being lazy or not. This was previously completely transparent to the userspace but now with the SVE instructions, one can in principle observe these and it is therefore something that rr has to keep track of/manage.
It also seems that this could be worse. While we can in principle track and record what the kernel does. The arm ISA document says that
(Search for MemSingleNF). The exact behavior here is of course implementation dependent and it's of course possible that the vendors are quite reasonable here. However, that's something that at least need to be tested.
This is relevant for any processor with SVE. The fujitsu-a64fx is probably the one with the highest hope of being able to run rr at the moment (Their PMU document doesn't mention the counter we use but the numbering of the rest agrees with the ARM PMU document so I think one need to just check if the ones we use are implemented...). This is likely going to matter more in the future since SVE and SVE2 are part of the armv9-a requirement and all future ARM processors starting from a510/a710, including neoverse-n2 and neoverse-v1 will have them (neoverse-n2 and v1 are not armv9 but n2 has SVE and v1 has SVE/SVE2). It's also perceivable that a distro would release a new version for armv9-a and binaries in it could be compiled with SVE turned out at compile time so masking off the feature may or may not work at that time...
The text was updated successfully, but these errors were encountered: