Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't suspend again after suspending one time. #11810

Closed
TheBeasT15 opened this issue Feb 23, 2019 · 75 comments

Comments

@TheBeasT15
Copy link

commented Feb 23, 2019

After suspending the system once can't suspend again. Shutdown gives unmount failed errors and can't shutdown too.
System info
System: Host: Capsparrow Kernel: 4.20.10-1-MANJARO x86_64 bits: 64 compiler: gcc v: 8.2.1 Desktop: KDE Plasma 5.15.0 Distro: Manjaro Linux Machine: Type: Laptop System: HP product: HP 245 G4 Notebook PC v: Type1ProductConfigId serial: <filter> Mobo: HP model: 80C7 v: KBC Version 98.0E serial: <filter> UEFI: Insyde v: F.1C date: 10/29/2015 Battery: ID-1: BAT0 charge: 38.8 Wh condition: 39.7/39.7 Wh (100%) model: Hewlett-Packard Primary status: Charging CPU: Topology: Quad Core model: AMD A8-7410 APU with AMD Radeon R5 Graphics bits: 64 type: MCP arch: Puma rev: 1 L2 cache: 2048 KiB flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 17572 Speed: 998 MHz min/max: 1000/2200 MHz Core speeds (MHz): 1: 1006 2: 1087 3: 1177 4: 1288 Graphics: Device-1: AMD Mullins [Radeon R4/R5 Graphics] vendor: Hewlett-Packard driver: radeon v: kernel bus ID: 00:01.0 Display: x11 server: X.Org 1.20.3 driver: ati,radeon unloaded: modesetting resolution: 1366x768~60Hz OpenGL: renderer: AMD MULLINS (DRM 2.50.0 4.20.10-1-MANJARO LLVM 7.0.1) v: 4.5 Mesa 18.3.3 direct render: Yes Audio: Device-1: AMD Kabini HDMI/DP Audio vendor: Hewlett-Packard driver: snd_hda_intel v: kernel bus ID: 00:01.1 Device-2: AMD FCH Azalia vendor: Hewlett-Packard driver: snd_hda_intel v: kernel bus ID: 00:14.2 Sound Server: ALSA v: k4.20.10-1-MANJARO Network: Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Hewlett-Packard driver: r8168 v: 8.045.08-NAPI port: 2000 bus ID: 01:00.0 IF: eno1 state: down mac: <filter> Device-2: Broadcom and subsidiaries BCM43142 802.11b/g/n vendor: Hewlett-Packard driver: wl v: kernel port: 2000 bus ID: 05:00.0 Drives: Local Storage: total: 465.76 GiB used: 68.75 GiB (14.8%) ID-1: /dev/sda vendor: Toshiba model: MQ01ABF050 size: 465.76 GiB Partition: ID-1: / size: 147.39 GiB used: 14.49 GiB (9.8%) fs: ext4 dev: /dev/sda1 ID-2: /home size: 294.29 GiB used: 54.26 GiB (18.4%) fs: ext4 dev: /dev/sda2 ID-3: swap-1 size: 4.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/sda3 Sensors: System Temperatures: cpu: 52.9 C mobo: 0.0 C gpu: radeon temp: 54 C Fan Speeds (RPM): N/A Info: Processes: 176 Uptime: 1h 07m Memory: 3.32 GiB used: 1.74 GiB (52.3%) Init: systemd Compilers: gcc: 8.2.1 Shell: zsh v: 5.7.1 inxi: 3.0.30
Journalctl :

`➜ ~ journalctl -xe -p3 -b

-- Subject: A start job for unit network-suspend.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- A start job for unit network-suspend.service has finished with a failure.

-- The job identifier is 1544 and the job result is failed.
Feb 22 20:06:49 Capsparrow systemd[1]: sleep.target: Failed to set invocation ID for unit: File exists
Feb 22 20:06:49 Capsparrow systemd[1]: Failed to start Sleep.
-- Subject: A start job for unit sleep.target has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- A start job for unit sleep.target has finished with a failure.

-- The job identifier is 1543 and the job result is failed.
Feb 22 20:06:49 Capsparrow systemd[1]: network-resume.service: Failed to set invocation ID for unit: Fi>
Feb 22 20:06:49 Capsparrow systemd[1]: Failed to start Network resume service.
-- Subject: A start job for unit network-resume.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- A start job for unit network-resume.service has finished with a failure.

-- The job identifier is 1628 and the job result is failed.`

@poettering

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

These failures almost surely are kernel or driver issues. Please contact your downstream distro about this first, and let them escalate issues to us if they are sure this is a systemd issue, which however I think is unlikely.

Also, when you file a bug here, please fill in the form supplied, i.e. provide the systemd version and such. We put that form up for a reason. Thank you for understanding.

@poettering poettering closed this Feb 25, 2019

@bl33pbl0p

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

@poettering Someone has reported this in #systemd IRC before. The key thing here is that after the first sleep, systemd returns -EEXIST for unit_set_invocation_id (which uses hashmap operations entirely).

It might still not be a systemd bug, but the upshot is units end up failing due to this error, so neither does suspend work for the reporter, nor shutdown, in particular, their logs said that multiple jobs failed due to the "Failed to set invocation ID for unit: File exists".

In their case, writing the value to the sysfs file would work, but making systemd start the target up wouldn't.

@bl33pbl0p

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

@poettering Some bug reports in other distributions which were filed very recently:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921267

https://bbs.archlinux.org/viewtopic.php?id=244399

Looks it broke after upgrading to a newer systemd for them...

@poettering

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

ok

@poettering poettering reopened this Feb 26, 2019

@Experimenter

This comment has been minimized.

Copy link

commented Mar 7, 2019

Facing a similar issue, downgrading systemd to v239 solves the problem. The same reappears on v240 & v241. Tested with Manjaro and Debian sid.

@TheBeasT15

This comment has been minimized.

Copy link
Author

commented Mar 8, 2019

@Experimenter Yes I did that now its resolved.

@sandy-8925

This comment has been minimized.

Copy link

commented Mar 10, 2019

I'm also facing this problem all the time, on a fresh Arch Linux install, on a laptop with an AMD CPU.

I have another laptop and desktop, each with Intel CPUs and neither face this problem. Any chance that this specifically happens on AMD CPUs?

@poettering poettering added the login label Mar 11, 2019

@TheBeasT15

This comment has been minimized.

Copy link
Author

commented Mar 17, 2019

Yep I am on AMD too I tried on Intel CPU worked flawlessly.

@Experimenter

This comment has been minimized.

Copy link

commented Mar 17, 2019

@thodnev

This comment has been minimized.

Copy link

commented Mar 28, 2019

Also got AMD CPU and experienced the same problem after latest Arch upgrade.
Downgraded packages systemd, lib32-systemd and systemd-sysvcompat to version 239.6-1 and it started working like a charm.
The problem is that it got packaged and distributed to many users as stable update.
Probably distro maintainers should be more careful.
sudo lshw
journalctl -b -1

@oyvinds

This comment has been minimized.

Copy link

commented Apr 5, 2019

got this issue on Fedora 30 with kernel 5.0.6 on "AMD E1-6010 APU with AMD Radeon R2 Graphics". systemd version systemd-241-4.gitcbf14c9.fc30.x86_64.

After suspend any and all services fail to start with "Failed to set invocation ID for unit: File exists" and the machine is essentially unusable.

Tried using both radeon and amdgpu for the gpu part of the machines apu, that makes no difference.

I'm not sure where to begin figuring out why systemd breaks after return from suspend. Let me know if/how I can provide more information.

@ChrisJAllan

This comment has been minimized.

Copy link

commented Apr 6, 2019

I have this issue on my laptop with "AMD A6-6310 APU with AMD Radeon R4 Graphics", but not on my desktop with an AMD CPU and an nVidia GPU, so it's probably something on the GPU end.

@oyvinds

This comment has been minimized.

Copy link

commented Apr 6, 2019

Just a little detail, suspend to disk doesn't have this problem on my machine - just suspend to RAM. Thus, using suspend to disk instead is a somewhat acceptable workaround.

@ranjan-purbey

This comment has been minimized.

Copy link

commented Apr 25, 2019

In my case, the problem occurs not only with suspend but also after restarting any of the systemd services. For e.g., running the following:

$ sudo service systemd-resolved restart

causes any subsequent attempts to start/restart a service to fail.
Yes, I am also on AMD CPU + Radeon GPU

@ranjan-purbey

This comment has been minimized.

Copy link

commented May 1, 2019

Even v142 has the same issue. So for the time being, downgrading to v139 seems to be the only work-around.

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

Apparently, this issue is easy to trigger, if you have the right hardware (and unfortunately none of the developers seem to have those).
So, if anyone who is affected by this issue can run a git bisect between v239 and v240 to find the first faulty commit, this would be super helpful.

@oyvinds

This comment has been minimized.

Copy link

commented May 1, 2019

Triggering this really is super-easy on the AMD E1-6010, it's one of those happens-every-time bugs. Doing a git bisect on a 1.4 GHz dual-core isn't very tempting, though. That's the only hardware I have where this happens, Ryzen+RX570 doesn't have this problem. I could do if nobody else with hardware that doesn't spend days compiling systemd (or anything) does it.

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

@oyvinds That would be great, thanks! If the AMD E1-6010 is indeed that slow, I would consider compiling it on your faster Ryzen system and copying the binaries/rpm over.

@madhur4127

This comment has been minimized.

Copy link

commented May 1, 2019

@oyvinds @mbiebl, I am also affected by the bug and I want to help, but I am new to Open Source and require guidance as this project is so big!

I have AMD A8-7410 quad-core 2.2Ghz, so I think it would suffice.

@sandy-8925

This comment has been minimized.

Copy link

commented May 1, 2019

I haven't used that problematic system recently, will see if I can still reproduce, and if yes then I will try bisecting.

@jimy-byerley

This comment has been minimized.

Copy link

commented May 4, 2019

Hello everyone.
I have the same problem with an AMD E1 and AMD radeon R2. My laptop can't suspend a second time and then can neither shutdown.
I got the same kind of message in journalctl:

mai 04 12:09:20 wopr systemd[1]: systemd-suspend.service: Failed to set invocation ID for unit: File exists
mai 04 12:09:20 wopr systemd[1]: Failed to start Suspend.
-- Subject: L'unité (unit) systemd-suspend.service a échoué
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- L'unité (unit) systemd-suspend.service a échoué, avec le résultat failed.
@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 4, 2019

I guess at this point we don't need further "me too's", as this doesn't really help us to find the root cause. Instead we need someone with the appropriate hardware to find the first faulty commit.

@madhur4127

This comment has been minimized.

Copy link

commented May 5, 2019

@mbiebl, I am currently bisecting between v239 and v240.

Using systemd-nspawn -bi image.raw to test images causes image to be loaded in terminal itself. How can I test whether the current commit is good/bad?

I thought of using systemctl suspend twice to test for image but the error displayed is: Failed to suspend system via logind: Sleep verb "suspend" not supported

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 5, 2019

@madhur4127 You'll need to run systemd no bare-metal and not inside a systemd-nspawn container. You can't suspend a container via systemctl suspend.

@jimy-byerley

This comment has been minimized.

Copy link

commented May 6, 2019

I was trying to compile systemd to do a git bisect, but I went to an error during ./configure because my version of libmount (use by command mount) is to old (<= 2.30, I use a debian stable for the compilation)
I tried to recompile libmount myself, from util-linux, but after that pkgconfig continues to say my libmount version is 2.29. Is there a way to correct it ?

@vcaputo

This comment has been minimized.

Copy link
Member

commented May 6, 2019

@jimy-byerley In my experience it's simpler to just modify the meson.build file where the libmount >= 2.30 dependency is specified. However Debian stable's version of Meson is also too old for the systemd meson.build syntax, so I also have a newer Meson version built from source in /usr/local.

With those two changes systemd builds on Debian stable for me, unless I'm forgetting other things it needed.

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

then you punish everyone on intel... we can certainly mask this out for amd cpus but if you ask me this sounds like something the kernel sould deal with, not us.

I agree with you, that ultimately the kernel should deal with this. That said, I think we should also address this from the systemd side for the time being, until a fixed kernel is available.
The current failure mode on affected systems is simply too nasty to not deal with this.

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@tytso generi,c classic linux OSes are usually booted with an initrd, and those then search for the storage to use. These initrds run systemd themselves (as finding the storage might involve plenty services to make complex storage work). This means in order to find the secret key you need the storage, but to find the storage we start plenty services and thus want to generate uuids and thus need the secret key... So you have an ordering cycle here.

The good thing about RDRAND is that when available is available without any pre-condition of having seen storage already.

@tytso btw, is there a way how the boot loader could supply some random seed to the kernel to seed its initial pool from? i.e. does the kernel maybe accept a kernel cmdlne option with random data? if so it should be easy to teach a boot loader to look for maybe a special raw partition that only contains random seed data, read that, mark it as invalidated and pass it to the kernel via the kernel cmdline. During OS boot the partition could then be filled with new seed data. That way, the kernel would always come up with a full pool, and the "saved random seed" logic would not be something we do during late boot (i.e. too late to be useful).

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

Btw, is using getrandom() still problematic during early boot if the kernel has been built with CONFIG_RANDOM_TRUST_CPU=y?

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@mbiebl does debian set CONFIG_RANDOM_TRUST_CPU=y? I think doing that is generally a good idea. However we probably still want to support either case in systemd. (And dunno, maybe there's also the theoretical chance that CONFIG_RANDOM_TRUST_CPU stuff gets confused by RDRAND failing if you can make the system suspend so early that the RDRAND feeding thread hasn't completed its work yet...)

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

@poettering yes, the Deban kernel does set CONFIG_RANDOM_TRUST_CPU=y.
Which is why I was considering reverting cc83d51 for Debian (at least for buster). Or is there still a benefit of using RDRAND directly?

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@mbiebl i think if you set that kernel option getrandom() should generally be fine to always use on systems with RDRAND. But I figure @tytso might know this better. IIUC the kernel will feed RDRAND data into the pool until the pool is initialized, inside a kernel thread specific to that purpose. I would assume that thread is finished with everything by the time userspace is invoked, but not sure if that's guaranteed. If it isn't htere might still be value in doing RDRAND from userspace here even on kernels that do have CONFIG_RANDOM_TRUST_CPU=y set, because otherwise systemd would race against that feeder thread...

@tytso

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

@poettering Sure, the right long-term answer is to make this to be the bootloader's problem. We could extend the boot loader protocol so that in addition to the initrd and boot command line, the boot loader could also pass to the kernel a 32 byte "secure random seed". That just pushes the problem back one level, and raises the question how much do you trust bootloader authors? Or for that matter, if the bootloader is going to be asking for random numbers from the UEFI BIOS, BIOS authors? They probably have the same level of competence as the people who wrote Intel's Management Engine (IME) and look what a disaster that was. Worse, the UEFI BIOS is all closed source, and not auditable. Still, this is the solution that OpenBSD chose, and given that we assume that many bootloaders have access to file systems (although in some cases the quality of that code is also not one that has made, say, the XFS developers very happy), it does mean that we can assume the bootloader can read a secure seed file from some file in the root file system, and then hopefully we can coordinate with systemd or other userspace components to refresh the secure seed file (a) as part of the initial bootup, and (b) also as part of the clean shutdown procedure. The problem is actually implementing this solution. Getting people with both (a) skills, (b) time, and (c) access directly, or indirectly, to get changes pushed into all of the various components, has just not happened up until now. We basically need people to implement an example with patches to grub, the kernel, and some systemd unit files. Maybe a GSOC project?

As far as how getrandom(2) works, we always will mix in RDRAND if it is available. This is separate from using RDRAND to help initialize the CRNG. Without CONFIG_RANDOM_TRUST_CPU, we'll mix in RDRAND since it can never make things worse, and hopefully will make things better ---- but we don't give any credit for the contributions from RDRAND, either. This is essentially to satisfy those people who are convinced that Intel let the NSA insert a back door into RDRAND, and/or looked at the quality of Intel's Management Engine implementation, and decided not to trust RDRAND on grounds of competence, as opposed to theories of secret collaboration with the US government. Even if you think that RDRAND is massively incompetently written, such that it's always returning 0xFFFFFFFF, XOR'ing a constant isn't going to make things any worse.

Ultimately, whether or not you enable CONFIG_RANDOM_TRUST_CPU is really more of a political question than anything else. Personally, I figure that if you are willing to use Intel CPU's, you had better darn well trust them. And if you don't trust them to get RDRAND right, how would you be trusting them not make their CPU massively susceptible to side-channel cache timing attacks? (Oh, wait.... :-)

BTW, programs using RDRAND are supposed to check the Carry Flag (CF); if the CF is 0, that means no random value was available. So if AMD is returning a failure after a suspend resume, that's an unfortunate implementation, but it's not an outright violation of the abstraction contract. If it is returning a CF of 1, *and * always returning 0xFFFFFFFF, that's a complete disaster, and customers should be screaming at AMD for a refund. Was systemd checking the CF flag after calling RDRAND?

@oyvinds

This comment has been minimized.

Copy link

commented May 7, 2019

Is there a way to detect such a broken RDRAND implementation and fall back to getrandom() automatically?
You could check when RDRAND is used. Get a value from RDRAND and if it's 0xFFFFFFFFFFFFFFFF you call back to getrandom().

You could also check if the CPU is AMD CPU family 22. Family 21 doesn't have RDRAND and Family 23 (Ryzen) is fine.

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@poettering Sure, the right long-term answer is to make this to be the bootloader's problem. We could extend the boot loader protocol so that in addition to the initrd and boot command line, the boot loader could also pass to the kernel a 32 byte "secure random seed". That just pushes the problem back one level, and raises the question how much do you trust bootloader authors?

well, there's sd-boot (a uefi boot loader we came up with in the systemd context), whose authors I tend to trust... ;-)

Or for that matter, if the bootloader is going to be asking for random numbers from the UEFI BIOS, BIOS authors? They probably have the same level of competence as the people who wrote Intel's Management Engine (IME) and look what a disaster that was. Worse, the UEFI BIOS is all closed source, and not auditable. Still, this is the solution that OpenBSD chose, and given that we assume that many bootloaders have access to file systems (although in some cases the quality of that code is also not one that has made, say, the XFS developers very happy), it does mean that we can assume the bootloader can read a secure seed file from some file in the root file system, and then hopefully we can coordinate with systemd or other userspace components to refresh the secure seed file (a) as part of the initial bootup, and (b) also as part of the clean shutdown procedure. The problem is actually implementing this solution. Getting people with both (a) skills, (b) time, and (c) access directly, or indirectly, to get changes pushed into all of the various components, has just not happened up until now. We basically need people to implement an example with patches to grub, the kernel, and some systemd unit files. Maybe a GSOC project?

I am certainly willing to work on the systemd side of things. I figure the kernel changes would be small I presume (in particular if this is implemented via a new kernel cmdline arg). Grub otoh...

BTW, I know of some cloud people who are interested in provisioning each image they spawn in their environments with an individually generated random seed. For cases like that it would be perfect if the 32byte value you suggested could be supplied via the kernel cmdline, since they have nice deployment infrastructure for controlling that already, and it needs no boot loader patching. Hence I'd actually prefer supplying the random seed via cmdline arg rather than boot protocol structures.

As far as how getrandom(2) works, we always will mix in RDRAND if it is available. This is separate from using RDRAND to help initialize the CRNG. Without CONFIG_RANDOM_TRUST_CPU, we'll mix in RDRAND since it can never make things worse, and hopefully will make things better ---- but we don't give any credit for the contributions from RDRAND, either. This is essentially to satisfy those people who are convinced that Intel let the NSA insert a back door into RDRAND, and/or looked at the quality of Intel's Management Engine implementation, and decided not to trust RDRAND on grounds of competence, as opposed to theories of secret collaboration with the US government. Even if you think that RDRAND is massively incompetently written, such that it's always returning 0xFFFFFFFF, XOR'ing a constant isn't going to make things any worse.

BTW, could we please get CONFIG_RANDOM_TRUST_TPM= as well? I am pretty sure people who have a TPM would trust it as much or as little as the CPU itself...

BTW, programs using RDRAND are supposed to check the Carry Flag (CF); if the CF is 0, that means no random value was available. So if AMD is returning a failure after a suspend resume, that's an unfortunate implementation, but it's not an outright violation of the abstraction contract. If it is returning a CF of 1, *and * always returning 0xFFFFFFFF, that's a complete disaster, and customers should be screaming at AMD for a refund. Was systemd checking the CF flag after calling RDRAND?

Yes, it appears that AMD is not correctly setting CF. We check for that, and are actually entirely fine if it's not set and will fall back to getrandom() in that case.

See: https://github.com/systemd/systemd/blob/master/src/basic/random-util.c#L54

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

You could also check if the CPU is AMD CPU family 22. Family 21 doesn't have RDRAND and Family 23 (Ryzen) is fine.

is that verified? i presume cpuid will tell me the family? do you have example code maybe?

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

i presume i would first call cpuid with eax=0 and check for AuthenticAMD, and then call it with eax=1 and check for the family bits?

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

that said i wonder though if it wouldn't be easier to simply refuse accepting ULONG_MAX as random nr, and if we see it consider that equivalent to cf == 0. And do this regardless of the CPU vendor. However, for that it would be good to know if the affected cpus really always return ULONG_MAX and nothing else.

@poettering

This comment has been minimized.

Copy link
Member

commented May 7, 2019

hmm, so here's a patch doing the ULONG_MAX check:

https://paste.fedoraproject.org/paste/Qhao0f9NszPj8K9EgCSbnw

Maybe people with an affected system could give that a whirl? it's a very simple way to hopefully filter out the bogus cases, under the assumption rdrand always returns ULONG_MAX when it fails like this.

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 8, 2019

@poettering would it make sense to log about it if we get crap data (say with log_debug())?

@dj-on-github

This comment has been minimized.

Copy link

commented May 8, 2019

Returning RdRand-->{c=0,result=-1} on an RNG underflow is inconsistent with the RdRand instruction spec. On underflow the carry is cleared and 0 is returned. This behaviour can be seen with RdSeed if you pull fast enough. The Intel SDG says to retry RdRand (based on carry) up to 10 times and then infer a failure if you don't get cc=1 by then (thus filtering hypothetical transient underflows). I don't work for AMD so I can't intuit what's going on, but where I work, lots of engineering went into meeting the resume dwell time so that this kind of problem doesn't happen on resume.

A note of caution on the TPM - If you're using the TPM RNG as the only source of entropy and you're trying to do the secure transaction (I forget the name) on the TPM wires, there is no entropy on the host side for the nonce. You're literally using the RNG within the TPM to provide the entropy for the nonce for both the TPM and the Host messages. A security mess if ever there was one, especially with a MITM between the TPM and host.

@poettering

This comment has been minimized.

Copy link
Member

commented May 9, 2019

@poettering would it make sense to log about it if we get crap data (say with log_debug())?

Sure, we can do that, if we decide to merge the patch. In its current form it's only an excercise though, to see if this actually works around the issue. Would be great if anyone affected by this could test it and see if it fixes things for them properly. If so, we can rework this and merge it. If noone is willing to test this this will remain open though, sorry.

@jimy-byerley

This comment has been minimized.

Copy link

commented May 9, 2019

I can test, I have a fast processor to compile for my small one. If the trick of @vcaputo of changing the libmount version in systemd dependencies is working at least ..

@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

I've compiled Debian packages for sid/buster. In case anyone running Debian wants to test the patch from @poettering
See https://people.debian.org/~biebl/systemd/buster/

@jimy-byerley

This comment has been minimized.

Copy link

commented May 10, 2019

Suspending and resuming after is working again thanks to the packages of @mbiebl !
I tried 4 times to close/open my laptop and it works well again.
Is there any journalctl information you need ?

poettering added a commit to poettering/systemd that referenced this issue May 10, 2019

random-util: eat up bad RDRAND values seen on AMD CPUs
An ugly, ugly work-around for systemd#11810. And no, we shouldn't have to do
this. This is something for AMD, the firmware or the kernel to
fix/work-around, not us. But nonetheless, this should do it for now.

Fixes: systemd#11810

poettering added a commit to poettering/systemd that referenced this issue May 10, 2019

random-util: eat up bad RDRAND values seen on AMD CPUs
An ugly, ugly work-around for systemd#11810. And no, we shouldn't have to do
this. This is something for AMD, the firmware or the kernel to
fix/work-around, not us. But nonetheless, this should do it for now.

Fixes: systemd#11810
@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

thanks for testing @jimy-byerley
For completeness sake, can you attach your lscpu output.

@poettering

This comment has been minimized.

Copy link
Member

commented May 10, 2019

I prepped a version of the earlier patch that should be good enough to commit now in #12536. PTAL.

Note that this is really just an ugly work-around, and we really shouldn't have to do this in userspace. Somebody who cares about this really should look into this, and put some pressure behind getting this fixed properly, i.e. in the Linux kernel: the CPU feature should probably be masked if RDRAND doesn't really work on those faulty CPUs. Maybe that someone should also contact AMD technical folks about this. It's unlikely that systemd is the only userspace tool affected by this, and the linux kernel might be as well if it credits entropy to an RDRAND implementation that has no entropy...

I am not going to be that someone though, my interest in AMD CPUs is relatively limited.

poettering added a commit to poettering/systemd that referenced this issue May 10, 2019

random-util: eat up bad RDRAND values seen on AMD CPUs
An ugly, ugly work-around for systemd#11810. And no, we shouldn't have to do
this. This is something for AMD, the firmware or the kernel to
fix/work-around, not us. But nonetheless, this should do it for now.

Fixes: systemd#11810
@mbiebl

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

@tytso do you know someone from AMD who could chime in here?

@Hunman

This comment has been minimized.

Copy link

commented May 14, 2019

@poettering I have an AMD A4 with this issue, I can try that workaround after work

@ignaciocaamanio

This comment has been minimized.

Copy link

commented May 22, 2019

I have this issue with my AMD cpu and wanted to test this, but I can't build systemd. I use archlinux, mkosi fails with a lot of "target not found errors: ". Those pkg aren't present in arch, I don't know how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.