New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support AMD Ryzen? #2034
Comments
|
I've written a testcase that just creates 100 do-nothing threads and then joins them all. Running 32 basic_test.run copies of that test in parallel usually means a few of them fail. |
|
Some stats for 8 runs of 32 parallel tests each:
|
|
Here's a different run of 16 x 32 parallel tests:
|
|
FWIW the executed syscall counts for one of those tests: |
|
And just for reference the syscalls before the overcount was detected are 14 |
|
I tried writing a test that does a lot of mmap/mprotect/munmap in a loop and couldn't get it to fail much. When I put that loop in 10 parallel threads I'd still mostly get failures around thread creation. |
|
Turns out |
|
Now a lot of the tests are timing out. Which is weird. I'm also seeing some other issues I didn't see before. |
|
Ah OK. When |
|
I checked whether, if we create a counter with no interrupt set and one with an interrupt set, they always agree. They do, even when they overcount. |
|
I tried creating three counters, one counting user-only events (U), one counting kernel events (K), and one counting both (A). You'd expect U + K = A, and that holds on Intel apparently, but on AMD it never does! A almost always has extra events, usually 8, sometimes 16, sometimes 1, once in a while a lot more... |
|
FWIW those changes are not correlated with overcounts. So that's a dead end probably. |
|
Also, that implies the problem is probably not an issue of kernel-mode events being incorrectly counted. |
|
Interesting: there are a lot more failures running the 32-bit tests: 117 vs 39 (out of 8 x 32) just now. |
|
By spraying I'm out of ideas. It appears that the Ryzen PMU just isn't quite accurate enough :-(. rr might work OK for some kinds of usage but I wouldn't recommend it. I'll land the patches I have with a warning for Ryzen users that things won't be reliable. |
|
FYI AMD has posted an errata for its Ryzen CPUs and it includes multiple issues with performance counters, namely:
None of them involves PMCx0D1 directly (which I believe is what rr uses). Either way none of them has a planned fix or suggested workaround. |
|
Of those errata:
|
|
The patches in PR #2255 might work on Ryzen. It would be great if someone could test. You'll have to change |
|
Here is the patch to test with: cdf4e27 |
|
I just checked the Bios and Kernel Developers' Guides on the AMD page, and all recent AMD CPUs appear to have PMCs 0xc4 and 0xc6; of course that doesn't mean they're reliable, but it might be worth checking this on all those we can still find users of. |
|
I get 57 test failures on Ryzen ( |
|
Looks like Ryzen still doesn't work. I landed a I get 10+ @pipcet it would be great if you can run that test yourself to make sure you don't get any errors on your Bulldozer machine. |
|
That is run from the rr |
|
The symptoms are similar to the issues I saw with the conditional branches approach before, so it's possible that at some point after Bulldozer AMD introduced a bug that destablized multiple types of counters. |
|
The Some ideas:
|
|
There's also section 2.1.11.2 in https://developer.amd.com/wp-content/resources/56255_3_03.PDFm/wp-content/resources/56255_3_03.PDF, which I don't think applies here:
However, that would probably require kernel hacking... |
Again, I run |
|
Looks like your kernel wasn't responsible for any of the effects after all, my bad. Btw you can edit your comments to use this: <details>
<summary>rdpmc-bench output (click to open)</summary>
...
</details>(where the |
|
Hi! Thanks for working on this, this is really awesome! I can confirm that before running the zen_workaround script, ctests indicated mostly failures. After successfully running the script (it says the workaround is in place), I get only one test failure in 4 runs (details below: there are two intermittent failures). The Most importantly, I could record and replay a run of a Spidermonkey JS test failure, that included reverse-stepping around voluntarily-buggy JIT assembly code, so I can now do my job with this new RR \o/ Thanks a bunch to y'all! uname + cpuinfo`ctest -j$(nproc)`Seen in 3/4 runs: Seen in 1/4 runs: |
|
@bnjbvr the output of |
|
Good idea! Failure of 565(I'll edit this post if i manage to reproduce the second one; might have been a timeout) |
Could this be the problem? What is the value of |
It's not conclusive. I've had the test intermittently pass with |
|
There should be a trace in |
Is there anything else you'd like me to test? |
Please spin off this 565 failure into its own issue. EDIT: I did it myself: #2694 |
|
@tuxiqae Not from me, but you should get some guidance from @glandium or @rocallahan about your failures. Btw you didn't need to post the whole logs, just the list of failed tests, at the bottom, e.g. this was the only relevant part from your "latest kernel": FWIW, @mati865 (on Ryzen 1600) and @nagisa (on Ryzen 1700) ran my counter test, and those CPUs don't behave any differently from any other Zen (1, 1+ or 2) ones, so I guess the Okay, @tuxiqae, there is one more thing: if you don't mind me asking, what's your motherboard? Because that's the only source of |
This is essentially the same failure set as #2681. |
My motherboard is |
|
(when quoting it would be a good idea to not include all the log snippets because they add up. though it might not matter, I think I'll copy my comments with information dumps into a larger report or something, eventually, to avoid linking to this thread) |
|
@khuey thanks! I Ctrl+F'd but ofc I didn't have most of the comments loaded because there's been so many. @tuxiqae If you missed it above (#2034 (comment)), @khuey said all your test failures are the ones in #2681, and if I had to guess, they're probably related to your more recent kernel version (doesn't mean you should use an older kernel, I think the plan is for |
|
So as of now it is bound to fail I guess, I'll just wait for an update. |
|
You may still be able to use |
|
Those failures are unlikely to impact a real application. They just affect our tests (which try to exercise the entire syscall space, including syscalls that may not exist or work properly on your kernel/libc). |
|
Just to add to @eddyb's information, |
|
Please file a new issue and give us the verbose logs for those test failures. Thanks! |
|
I see that there is a kernel module available, is there a kernel patch flying around to try out? |
|
Thanks for the excellent work on this ticket earlier !! I was wondering if anybody knew what the situation with rr is on Zen3 and Zen3+. I'm specifically interested in the Ryzen 6000 series (Zen 3+) (Should see these in many 2022 AMD laptops). Are these workarounds related to SpecLockMap still required? Does SpecLockMap and SSB Mitigation occupy the same MSR? Does rr by any chance run without any modifications on a Zen3+ system? Thanks! |
The wiki page lists Zen 3 (Ryzen 5950X and 5800HS), and I doubt Zen 3 -> Zen 3+ changed anything, but we won't know for sure until someone actually tries it and reports back.
I would assume turning off The SSB thing is trickier, I vaguely remember that it might not be needed on Zen 2? That is, the MSR collision might only be on Zen 1. (a short while later) Alright, searching this thread I found #2034 (comment) which claims that there's a separate MSR for SSBD on (likely) Zen 2 (and after), so you might only need |
Ryzen has a conditional branch counter. I have patches to use it here: https://github.com/mozilla/rr/tree/ryzen
To make it work reliably I had to increase the skid counter to 1000. That's pretty high, but OK. The patches make the skid size configurable per-architecture so we don't take that hit on Intel.
With these patches, most tests pass and the rest seem to be intermittent. In one run I get 10 failures out of 2068:
It appears that all these failures are due to intermittent overcounting. In most of them, during recording we seem to have overcounted a few conditional branches in the leadup to some syscall. In the rest, we seem to have overcounted during replay.
One interesting thing is that most of the syscalls where we detect the overcount are an
mprotect(or a syscall following a syscall-bufferedmprotect) that followed anmmap. There are two exceptions, one areadsyscall and one awritesyscall. I need to think about what this might mean.The text was updated successfully, but these errors were encountered: