-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346
Comments
We spent a bit of time going back and forth on IRC about this, and it seems that only the scalar setting makes the problem go away. |
An update from the original thread:
|
@aerusso thank you for bringing this to our attention. The reported symptoms are consistent with what we'd expect if the fpu registered were someone not being restored. We'll see if we can reproduce the issue locally using the 4.19 kernel and the provided test case. Would it be possible to try and reproduce the issue using a 5.2 or newer kernel? |
Horrifyingly, I can reproduce this in a Debian buster VM on my Intel Xeon-D. I'm going to guess, since reports of this being on fire haven't otherwise trickled in, there might be a mismerge in Debian, or a missing followup patch? |
I did a test with a Manjaro live USB and I could not reproduce this behaviour. Kernel: 5.2.11-1-MANJARO |
This is a collection of some of the patches Debian applies to stable. I am hoping that openzfs#9346 can be triggered by a test here, as that would both explain why only Debian is able to reproduce the issue, and that there are already test cases to catch the error. Patches included: 2000-increase-default-zcmd-allocation-to-256K.patch linux-5.0-simd-compat.patch git_fix_mount_race.patch Fix-CONFIG_X86_DEBUG_FPU-build-failure.patch 3100-remove-libzfs-module-timeout.patch
I can reproduce it with kernel 4.19 and stress-ng too. With kernel 5.2 there are no errors.
|
Can confirm this too on 5.0. It seems that the assumption from the SIMD patch, that with 5.0 and 5.1 kernels preemption and local IRQ disabling is enough, is wrong:
If one checks out the kernel_fpu_{begin,end} methods from 5.0 kernel we can see that those safe the registers also. I can fix this issue by doing so, but my approach was really cumbersome as the "copy_kernel_to_xregs_err", "copy_kernel_to_fxregs_err" and "copy_kernel_to_fregs_err" methods are not avaialble, only those without "_err", but as those use the GPL symboled "ex_handler_fprestore" I cannot use them here. So for my POC fix I ensured that on begin we always save the fpregs, and for the end always restore, and to do so I just hacked over the functionally of those methods from the 5.3 Kernel:
Related to the removal of the SIMD patch in the (future) 0.8.2 release #9161 |
I can reproduce this with
|
@GamerSource thanks for digging in to this, that matches my understanding of the issue. What I don't quite understand yet is why this wasn't observed during the initial patch testing. It may be possible it was due to my specific kernel configuration. Regardless, I agree the fix here is going to need to be to save and restore the registers similar to the 5.2+ support. @shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this. |
I am 100% sure, as this Kernel Also the version I copy-pasted was directly from Edit: Interesting bit: I was not able to reproduce this with Edit²: Here is the line
This was 10 Minutes before my first comment. I let Edit³: I also checked if the hardware is fine, of course. Without ZFS |
@shartge would you mind checking the dkms build directory to verify that /* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1 |
|
Scratch that, I don't need that specific system to test the build, I can just use any Debian Buster system for that, for example any of my test VMs. Using
I am attaching the whole file in case it may be helpful. |
@shartge I had to reduce the CPUs to 18 for stress-ng because scrub was pausing while using all 32 CPUs. |
I now did a real test with the VM I used to do the compile test in #9346 (comment) and I am able to reproduce the bug very fast. Using a 4 disk RAIDZ and
CPU for this system is
Kernel and ZFS version can be found in #9346 (comment) |
This is a vmware client. |
This should not matter, as I can reproduce the same problem on 2 physical systems. But because both of them are production storages, it is easier for me to do this in a VM, as long as it shows the same behaviour. |
The worst thing is that inside KVM the VMs get FPU errors too even they don't use ZFS. |
This does not happen with VMware ESX for me. I've been running Only when I have ZFS active and put I/O load on it, the FPU errors start to occur. The same also happened for my with the two physical systems I used to test this. |
@shartge Are you using ZFS on VMware host? |
No! I just quickly created a test VM to test the compilation of the module without the need to use and interrupt my production storages. And I also tried to reproduce this issue here in a VM instead of a physical host, which, as I have show, I was successful in doing. But, again: The error is reproducible on normal hardware with 5.2 and 0.8.1. (Using a VM is just more convenient.) |
Summary:
|
With Buster and 5.2 I don't get the FPU errors but it's dom0 from XEN: #9346 (comment) |
Who knows what Xen does with the FPU state in the Dom0. It could be a false negative as well. |
Now I checked it with 5.2 and without XEN. No FPU errors.
|
Switching to AVX2/FMA3-FFT for mprime and using "fastest" (i.e AVX512) in ZFS also creates errors. Switching ZFS to AVX2 while keeping mprime also at AVX2 creates errors, too. And finally, setting ZFS to "ssse3" and keeping mprime at AVX2 still creates errors. But @Fabian-Gruenbichler was able to reproduce this, so I can finally stop doubting myself. |
Interesting observation: If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem. For local.txt:
And mprime passes all three self tests:
As soon as I enable anything above SSE2, starting with AVX, the errors return. |
In SSE modes XMM registers are used, which are lower half of AVX (YMM) registers (or lower quad of AVX512 ZMM registers). Since this issue seems to be about saving/restoring registers when switching threads, using only lower part of register technically shouldn't change anything. If Prime95 is actually using SSE2 instructions, that is... But maybe, just maybe, I'm really speculating here, kernel actually does save/restore on SSE (XMM) registers so the problem does not appear when Prime95 is only using XMM registers. It's upper part of YMM register that causes problem, that is, only SSE registers are saved/restored instead of whole 256 bit AVX ones. I don't know if this is possible :) Just thought I'd share an idea. EDIT: this could happen if FXSAVE instruction which is called explicitly by #9406 works as expected but XSAVE feature in kernel doesn't work or isn't called correctly for some reason. |
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to unconditionally save and restore the FPU state. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #9346 Closes #9403
I'd like to throw in a few words just before any final "fix" is committed and routed forward: https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ |
On October 25, 2019 1:38 am, vorsich-tiger wrote:
I'd like to throw in a few words just before any final "fix" is committed and routed forward:
There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs.
I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels.
I.e. any 5.2 host running a VM indicating a zfs-bug just *might* deliver a false positive.
I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.
https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/
i.e.
https://bugzilla.kernel.org/show_bug.cgi?id=204209
https://lkml.org/lkml/2019/7/17/758
etc.
the issue also occurs on baremetal hosts that have no VMs running
whatsoever, and on kernels earlier than 5.2.
|
On October 25, 2019 7:41 am, Fabian Grünbichler wrote:
On October 25, 2019 1:38 am, vorsich-tiger wrote:
> I'd like to throw in a few words just before any final "fix" is committed and routed forward:
> There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs.
> I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels.
> I.e. any 5.2 host running a VM indicating a zfs-bug just *might* deliver a false positive.
> I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.
>
> https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/
> i.e.
> https://bugzilla.kernel.org/show_bug.cgi?id=204209
> https://lkml.org/lkml/2019/7/17/758
> etc.
the issue also occurs on baremetal hosts that have no VMs running
whatsoever, and on kernels earlier than 5.2.
and the fixed 5.2/5.3 kernels are affected as well (the fix is contained
in 5.2.5 and all released 5.3 versions).
|
@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make. |
Negative on that. The same problem can be reproduced on non-KVM running baremetal hosts using 5.2+. |
I did not misunderstand your post. I am one of the devs who triaged this bug initially, analyzed the old code, verified a workaround on our downstream side, and reviewed the now merged fix 😉 see the detailed testing report (on baremetal!) over at the approach that was used for 5.2 was in theory sound for < 5.2, but not workable for GPL/license reasons. it was broken for 5.2+ though, as was the approach for < 5.2 on < 5.2 kernels. the only thing that really worked was the kernel-only solution (and a combination of 5.2+ approach with helper backports on < 5.2 kernels). in other words, it was broken all around, irrespective of other FPU-related breakage on some 5.2 versions.. |
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 Closes openzfs#9403
PR #9515 contains an 0.8 backport of the fix applied to master. |
I will be able to test this on my systems tomorrow GMT morning. |
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 Closes openzfs#9403
Is this by any chance the same bug the Go authors found: https://bugzilla.kernel.org/show_bug.cgi?id=205663#c2 |
@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this. |
On December 24, 2019 7:08 pm, Brian Behlendorf wrote:
@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.
I saw that while triaging (I think I even linked it in one of the issues
as possible culprit?) but quickly ruled it out. Might have affected some
user reports though, if they (/their distro) used the affected
kernel+gcc versions.
|
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to unconditionally save and restore the FPU state. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 (cherry picked from commit 813fd01) Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Closing. The SIMD patches have been included in the 0.8.3 release. |
At the moment we experience bad instabilities with linux 5.3: openzfs/zfs#9346 as the zfs-native method of disabling the FPU is buggy. (cherry picked from commit 96097ab)
- 2000-increase-default-zcmd-allocation-to-256K.patch - git_fix_mount_race.patch Remove patch: - linux-5.0-simd-compat.patch which causes #940932 under some certain conditions. openzfs/zfs#9346 Gbp-Dch: Full
System information
I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.
Describe the problem you're observing
Rounding error failure in mprime torture test that goes away when
/sys/module/zfs/parameters/zfs_vdev_raidz_impl
and/sys/module/zcommon/parameters/zfs_fletcher_4_impl
are set toscalar
.Describe how to reproduce the problem
Quoting the bug report:
Include any warning/errors/backtraces from the system logs
mprime:
The text was updated successfully, but these errors were encountered: