Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 1st gen AMD Ryzen segfault issue. #1348

Closed
wants to merge 1 commit into from
Closed

Fix 1st gen AMD Ryzen segfault issue. #1348

wants to merge 1 commit into from

Conversation

DiamondLovesYou
Copy link

XMRig wasn't clearing the cache after codegen, causing this issue due to a stale instruction cache. Oddly, this issue never occurred on my 1950x, only my 1800x.

@xmrig
Copy link
Owner

xmrig commented Dec 1, 2019

Can you replace __builtin___clear_cache to xmrig::VirtualMemory::flushInstructionCache from crypto/common/VirtualMemory.h because it not available for MSVC compilers.
Thank you.

@DiamondLovesYou
Copy link
Author

@xmrig Done

@xmrig
Copy link
Owner

xmrig commented Dec 1, 2019

Thank you.
@SChernykh please take look.

@m-o-e
Copy link

m-o-e commented Dec 1, 2019

Not working here (on Linux).
Segmentation fault after READY threads with and without above patch.

 * ABOUT        XMRig/5.0.1 gcc/5.4.0
 * LIBS         libuv/1.8.0 OpenSSL/1.0.2g hwloc/1.11.2
 * CPU          AMD Ryzen 7 1700X Eight-Core Processor (1) x64 AES
                L2:4.0 MB L3:16.0 MB 8C/16T NUMA:1
 * ASSEMBLY     auto:ryzen

@DiamondLovesYou
Copy link
Author

DiamondLovesYou commented Dec 1, 2019

@m-o-e Can you use a more recent GCC? I used 9.2.1 (from Ubuntu 19.10). I've been mining for the past 2hrs with this fix.

This "builtin" is actually defined in libgcc if GCC or in compiler-rt if clang (though clang the compiler doesn't provide this as a builtin). I don't see any recent changes to libgcc's version, but maybe some recent change elsewhere in GCC fixed this?

XMRig wasn't clearing the cache after codegen, causing this issue due
to a stale instruction cache. Oddly, this issue never occurred on my
1950x, only my 1800x.
@Svaag
Copy link

Svaag commented Dec 1, 2019

Fix not working for me either

 * ABOUT        XMRig/5.0.1 gcc/9.2.0
 * LIBS         libuv/1.33.1 OpenSSL/1.1.1d hwloc/2.1.0
 * CPU          AMD Ryzen 7 1700 Eight-Core Processor (1) x64 AES
                L2:4.0 MB L3:16.0 MB 8C/16T NUMA:1
 * DONATE       0%
 * ASSEMBLY     auto:ryzen
 * POOL #1      frankfurt-1.xmrpool.net:7777 coin monero
 * COMMANDS     hashrate, pause, resume
 * OPENCL       disabled
 * CUDA         disabled
[2019-12-01 04:46:03.035]  net  use pool frankfurt-1.xmrpool.net:7777  195.201.12.107
[2019-12-01 04:46:03.035]  net  new job from frankfurt-1.xmrpool.net:7777 diff 10000 algo rx/0 height 1978852
[2019-12-01 04:46:03.035]  rx   init dataset algo rx/0 (16 threads) seed 993ba25f61d47e1e...
[2019-12-01 04:46:03.307]  rx   allocated 2336 MB (2080+256) huge pages 100% 1168/1168 +JIT (272 ms)
[2019-12-01 04:46:05.955]  rx   dataset ready (2648 ms)
[2019-12-01 04:46:05.955]  cpu  use profile  rx  (8 threads) scratchpad 2048 KB
[2019-12-01 04:46:05.958]  cpu  READY threads 8/8 (8) huge pages 100% 8/8 memory 16384 KB (3 ms)
[2019-12-01 04:46:06.502]  cpu  accepted (1/0) diff 10000 (58 ms)
[2019-12-01 04:46:08.130]  cpu  accepted (2/0) diff 10000 (58 ms)
Segmentation fault (core dumped)

@DiamondLovesYou
Copy link
Author

What on Earth. I can't reproduce anymore, ie w/o this PR, my 1800x isn't segfaulting. It was before!

@SChernykh
Copy link
Contributor

If it's really about code cache then this will help: https://github.com/SChernykh/xmrig/tree/ryzen-fix
Note however that x86 architecture doesn't require clearing code cache, all changes are detected by CPU automatically. RandomX crashes on some 1st gen Ryzens are known hardware problem.

@SChernykh
Copy link
Contributor

Even more, "clear cache" intrinsics don't do anything when compiled on x86.

@repsac-by
Copy link

repsac-by commented Dec 1, 2019

@Svaag , try disabling opcache.

@m-o-e
Copy link

m-o-e commented Dec 1, 2019

Fun fact: xmr-stak-rx also segfaults for me on the same hardware! (also right at startup)

@DiamondLovesYou
I was about to try newer gcc but then saw @Svaag already did it.

@xmrig
Copy link
Owner

xmrig commented Dec 1, 2019

@m-o-e maybe because they just use xmrig code without any changes.

@kasualkeef
Copy link

I am also getting segfaults on two different Ryzen 1700X machines (one X370 board with latest BIOS and one X470 board with latest BIOS).

I have tried the patch above to no avail. The miner is crashing for me on Windows 10 and Fedora 31 (kernel 5.3).

@DiamondLovesYou
Copy link
Author

@SChernykh Indeed, and upon closer inspection, __clear_cache is a no-op on x86/x86_64.

Anyway, I can't reproduce locally anymore (the 1800x mined overnight w/o a segfault), so I won't be able to find a fix. Closing.

@Svaag
Copy link

Svaag commented Dec 1, 2019

As suggested by @repsac-by I turned off Opcache in bios under AMD CBS/CPU Common Options on my Asrock B450 Pro4 and xmrig stopped segfaulting without this patch. Performance seems to have taken a large hit, but at least it works.

@repsac-by
Copy link

repsac-by commented Dec 1, 2019

Performance seems to have taken a large hit, but at least it works.

@Svaag
What about hugepages?
sysctl vm/nr_hugepages=1200

@m-o-e
Copy link

m-o-e commented Dec 1, 2019

Fyi, with Opcache still enabled sysctl vm/nr_hugepages=1200 did not fix the Segfaults for me.

@Svaag
Copy link

Svaag commented Dec 1, 2019

Performance seems to have taken a large hit, but at least it works.

@Svaag
What about hugepages?
sysctl vm/nr_hugepages=1200

Huge pages were already activated (as seen in output above), but I realized the issue - late last night I recompiled xmrig for debugging in gdb and were still running that gimped version. I recompiled again a proper release version and got normal numbers, then upgraded to 5.1.0 and got even better:

[2019-12-01 17:06:26.895] speed 10s/60s/15m 4560.8 4577.6 n/a H/s max 4641.8 H/s
|    CPU # | AFFINITY | 10s H/s | 60s H/s | 15m H/s |
|        0 |        0 |   554.4 |   557.7 |     n/a |
|        1 |        1 |   532.0 |   530.5 |     n/a |
|        2 |        2 |   570.6 |   568.8 |     n/a |
|        3 |        3 |   568.6 |   566.4 |     n/a |
|        4 |        4 |   587.2 |   586.2 |     n/a |
|        5 |        5 |   590.4 |   590.2 |     n/a |
|        6 |        6 |   589.5 |   589.1 |     n/a |
|        7 |        7 |   589.3 |   589.3 |     n/a |
|        - |        - |  4581.9 |  4578.2 |     n/a |

So for me this issue seems completely fixed with opcache disabled.

@kasualkeef
Copy link

Disabling opcache in BIOS seems to have done the trick. Thanks!

@m-o-e
Copy link

m-o-e commented Dec 1, 2019

Any way to bypass opcache in software w/o touching bios?

@theshadowpeople
Copy link

For me it doesn't work but the H11DSI has the option only with MOD BIOS and i am not sure if this option work correct.

@aa-delite
Copy link

aa-delite commented Dec 3, 2019

Ryzen 1700X segfault. Fixed by disable opcache.
Ryzen 2200G (ASRock AB350M-HDV) segfault or just zero hashrate. Sometimes okay after few reboots.

@scoobybejesus
Copy link

scoobybejesus commented Dec 15, 2019

I am looking and looking. I dug through the bios. I looked through the manual. I'm scouring the internet. I can't figure out how to disable opcache in AB350 Pro4 mobo. Could it go by another name? Does anyone know how to disable opcache on this board? (I've got a 1700X seg faulting as well.)

Edit: my BIOS is version 5.80

@kasualkeef
Copy link

kasualkeef commented Dec 15, 2019 via email

@scoobybejesus
Copy link

I upgraded several times in order to get to the second to newest. The latest says "ASRock do NOT recommend updating this BIOS if Pinnacle, Raven, Summit or Bristol Ridge CPU is being used on your system," so I did not upgrade to that.

I think I'm stuck no option to disable opcache on this mobo.

I'm toying with just getting another motherboard. The problem is determining which ones have this setting. Someone with a B450 Pro4 n(@Svaag, above) managed to disable opcache. Then I found https://www.reddit.com/r/overclocking/comments/btqt2o/asrock_b450m_pro4_33_bios_update_lost_lots_of_oc/ where the guy says a BIOS update removed that capability. I suppose the B450 Pro4 and B450M Pro4 could have different BIOSes.. Hm. I was going to get a B450M, but I guess I need to go full ATX? I have the room I guess...

Any additional input before I end up throwing money at this problem would be great.

@aa-delite
Copy link

I've disabled Opcache and was happy with 5.1.0
I've enabled Opcache and tried 5.3.0 with wrmsr , it stops working.
I've disabled Opcache and tried 5.3.0 without wrmsr , it stops working.
I've disabled Opcache and tried 5.2.0, it stops working.
I've disabled Opcache and tried 5.1.0 again, it stops working.
Forgot about that, not rebooted, a day later I've tried 5.1.0 and it works.

So disabling opcache does not always work in-time.

@SChernykh
Copy link
Contributor

You need to reboot after running 5.3.0 with MSR to undo the changes from MSR mod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants