Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PSA] Default settings on motherboards set RAM to low frequency & low voltage on VDDCR_SOC & other RAM settings may cause a false postive segfault! #23

Open
protox opened this issue Dec 5, 2017 · 5 comments

Comments

@protox
Copy link

protox commented Dec 5, 2017

If you are getting segfaults with kill-ryzen double check your RAM settings, on my Taichi X370 motherboard with my UA1733PGS when you load default settings, RAM will drop to 2133MHz and VDDCR_SOC will go down to 0.880V with DRAM at 1.2V. If kill-ryzen is run on this setting, you get a segfault in a couple of minutes.

However loading 2933MHz XMP profile, will bump VDDCR_SOC to 1.096V & DRAM Voltage to 1.368V, and kill-ryzen is then able to run for many hours with zero issues.

Voltage is probably far too low, especially on default SOC for such an extensive test.

More info can be found here: https://www.reddit.com/r/Amd/comments/7ho4uv/is_the_culprit_of_linux_segfault_on_ryzen_cpu/

Make sure to double check your settings or you might think you have a faulty CPU, when it's not, especially if it's made after week 25.

Might be a good idea to add some of this information to the readme? I've read many forum posts recommending running this script on 2133MHz and default settings, which can be a problem.

@suaefar
Copy link
Owner

suaefar commented Dec 5, 2017

AMD asked me, and probably other clients, during the RMA to specifically try higher SOC voltages. It still crashed. So if you don't check it yourself in advance, AMD will probably ask you to rule out memory issues.

The test is sensitive to failures in different subsystem including memory corruption.
Generally, I would propose to first get a stable system [1] and then look for the bug in your Ryzen processor.
However, changing the SOC to higher values goes along with increasing the power consumption (probably beyond the 65W in case of the R7 1700), which is a workaround and no solution.
Running the memory outside the spec (i.e., overclocking the interconnect) to get a stable system does not seem a good solution to me.

In my opinion, a system must be stable with the default settings, without any tuning or tweaking.
If it is not, there is a problem and the manufacturers should help you troubleshooting, be it a processor bug, defective memory, or wrong BIOS settings.

Feel free to propose an addition to the readme and send a pull request.

[1] https://wiki.archlinux.org/index.php/Stress_Test

@protox
Copy link
Author

protox commented Dec 5, 2017

For AMD there seems to be a major disconnect between motherboard & ram manufactures. All I am saying is perhaps running on the lowest possible specification isn't a good idea for this test, because it is extremely extensive.

I'm not an expert in this field, so I'll leave it for someone else to make pull requests and decide if such a suggestion is a good idea or not.

At this same time it seems like this test is currently the primary way people are determining if they have a faulty CPU or not. It could be that a tiny bump to VDDCR_SOC can be enough to make a post week 25 system stable to avoid this issue and avoid an RMA?

@suaefar
Copy link
Owner

suaefar commented Dec 6, 2017

You are completely right, and objectively it seems to be a good suggestion.
But subjectively I run out of patience.

I am utterly frustrated by AMDs communication on this issue and I am not willing to help them any further:

  • AMD should acknowledge (or deny) the problem, i.,e., that many of their top-tier parts are defective.
  • AMD should call back all affected parts, if not from users at least from retailers! WTF they are still knowingly selling these...
  • AMD should provide a tool to test if your processor is affected (I only share the code that I used)

If they still(!) don't care, I don't mind unsettled users bugging them.
It's up to AMD to say something, to do something, to provide a tool!

@protox
Copy link
Author

protox commented Dec 7, 2017

I completely agree with you, AMD needs to do FAR more and at least come out and tell us what the issue is and which CPUs are affected. People are jumping through hoops, paying for shipping in certain countries and getting long delays to get their RMA at the moment.

@ghost
Copy link

ghost commented Feb 13, 2018

If anyone thinks about solution not requiring RMA, there seemingly is one (did not test on many CPUs, just on mine 1800X from week 22): I've disabled CPU micro Op Cache (called uOpCache/Op Cache/...). On i.e. ASUS X370 boards this requires a modded BIOS enabling AMD CBS menu, some other board may allow that on stock BIOS.

After disabling uOpCache, tests (both this one and windows 'bzip2 compiler' killer) stopped crashing and could run for hours without any crash until terminated manually. Performance loss is negligible (around ~3% on 7zip and compilation times, winrar seemingly even slightly (~1%) benefits from it in multi-threaded mode). According to some people noticing slight performance drops on 'fixed' Ryzens, these probably just come with uOpCache internally disabled or limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants