Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346

aerusso · 2019-09-22T16:06:07Z

System information

I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.

Type	Version/Name
Distribution Name	Debian
Distribution Version	stable
Linux Kernel	4.19.67
Architecture	amd64 (Ryzen 5 2600X and Ryzen 5 2600 on X470 GAMING PLUS (MS-7B79) BIOS version: 7B79vAC)
ZFS Version	zfs-linux/0.8.1-4~bpo10+1

Describe the problem you're observing

Rounding error failure in mprime torture test that goes away when
/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl are set to scalar.

Describe how to reproduce the problem

Quoting the bug report:

recently I have noticed some instability on one of my machines.
The mprime (https://www.mersenne.org/download/) Torture Tests would occasionaly show errors like

"FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file."

random commands would occasionaly segfault.

While trying to narrow down the problem I have replaced the PSU, RAM and the CPU. Multiple hour long runs of memtest86 did not show any problem.

Finally I was able to narrow down the reads from ZFS volumes as the trigger for the instability.
Scrubbing the volume would cause mprime to error out especially quickly.

As a workaround I switched the SIMD acceleration off by piping "scalar" to

/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl

and that made the system stable again.

Include any warning/errors/backtraces from the system logs

mprime:

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.

The text was updated successfully, but these errors were encountered:

rincebrain · 2019-09-22T22:07:03Z

We spent a bit of time going back and forth on IRC about this, and it seems that only the scalar setting makes the problem go away.

alex-gh · 2019-09-23T04:58:29Z

An update from the original thread:

A quick update:

I have booted up the Debian live USB on another machine and was able to
reproduce this bug with it.

The machine had the Ryzen 5 2600 CPU (the one I swapped with the machine
I have originally found the problem on).

The Mainboard is: ASUS PRIME B350-PLUS
BIOS Version: 5216

Output of uname -a:
Linux debian 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64
GNU/Linux

Output of zfs --version:
zfs-0.8.1-4bpo10+1
zfs-kmod-0.8.1-4bpo10+1

Also here are the steps I'm taking to reproduce the problem:

Start mprime for linux 64 bit

Select Torture Test

Choose 12 torture test threads in case of ryzen 5 (default setting)

Select Test (2) Small FFT

All other settings are set to default settings

Run the test

Read data from zfs by either reading a large file or starting a scrub.
(raidz scrubs are escpecially effective)

Within a few seconds you should see mprime reporting errors.

behlendorf · 2019-09-23T19:26:00Z

@aerusso thank you for bringing this to our attention. The reported symptoms are consistent with what we'd expect if the fpu registered were someone not being restored. We'll see if we can reproduce the issue locally using the 4.19 kernel and the provided test case. Would it be possible to try and reproduce the issue using a 5.2 or newer kernel?

rincebrain · 2019-09-24T00:46:06Z

Horrifyingly, I can reproduce this in a Debian buster VM on my Intel Xeon-D.

I'm going to guess, since reports of this being on fire haven't otherwise trickled in, there might be a mismerge in Debian, or a missing followup patch?

alex-gh · 2019-09-24T02:36:29Z

I did a test with a Manjaro live USB and I could not reproduce this behaviour.

Kernel: 5.2.11-1-MANJARO
ZFS package: archzfs/zfs-dkms-git 2019.09.18.r5411.gafc8f0a6f-1

This is a collection of some of the patches Debian applies to stable. I am hoping that openzfs#9346 can be triggered by a test here, as that would both explain why only Debian is able to reproduce the issue, and that there are already test cases to catch the error. Patches included: 2000-increase-default-zcmd-allocation-to-256K.patch linux-5.0-simd-compat.patch git_fix_mount_race.patch Fix-CONFIG_X86_DEBUG_FPU-build-failure.patch 3100-remove-libzfs-module-timeout.patch

ggzengel · 2019-09-25T06:24:50Z

I can reproduce it with kernel 4.19 and stress-ng too.
I get more than 5 errors per minute.

With kernel 5.2 there are no errors.

root# zpool scrub zpool1
root# stress-ng --vecmath 9 --fp-error 9 -vvv --verify --timeout 3600
stress-ng: debug: [20635] 32 processors online, 32 processors configured
stress-ng: info:  [20635] dispatching hogs: 9 vecmath, 9 fp-error
stress-ng: debug: [20635] cache allocate: default cache size: 20480K
<snip>
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 1.000000 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(-1000000.0) return was 1.000000 (expected 0.000000), errno=0 (expected 34), excepts=0 (expected 16)
stress-ng: fail:  [22389] stress-ng-fp-error: log(0.0) return was 51472868343212123638854435100661726861789564087474337372834924821256607581904275443789550923204262543290261262543297927616110435675714711004645013184740565747574812535257726048857959524537318313055909029913182014561534585350486375714439359868335816704.000000 (expected -0.000000), errno=34 (expected 34), excepts=4 (expected 4)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 0.000000 (expected inf), errno=0 (expected 34), excepts=8 (expected 8)
stress-ng: fail:  [22407] stress-ng-fp-error: exp(-1000000.0) return was -304425543965041899037761188749362776730427289735837064756329392319501601366578319214648354685850550352787929416219211679117562590779680584744448269412872882932591437212235151179776.000000 (expected 0.000000), errno=0 (expected 34), excepts=16 (expected 16)
stress-ng: fail:  [22397] stress-ng-fp-error: exp(DBL_MAX) return was 1.000315 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  32
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2399.755
BogoMIPS:            4800.04
Hypervisor vendor:   Xen
Virtualization type: none
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-31
Flags:               fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault intel_ppin ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt

# uname -a
Linux server2 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux

ThomasLamprecht · 2019-09-25T11:50:11Z

Can confirm this too on 5.0. It seems that the assumption from the SIMD patch, that with 5.0 and 5.1 kernels preemption and local IRQ disabling is enough, is wrong:

For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used. All non-kernel
threads will restore the preserved user FPU state.
-- commit message of commit e5db313

If one checks out the kernel_fpu_{begin,end} methods from 5.0 kernel we can see that those safe the registers also. I can fix this issue by doing so, but my approach was really cumbersome as the "copy_kernel_to_xregs_err", "copy_kernel_to_fxregs_err" and "copy_kernel_to_fregs_err" methods are not avaialble, only those without "_err", but as those use the GPL symboled "ex_handler_fprestore" I cannot use them here.

So for my POC fix I ensured that on begin we always save the fpregs, and for the end always restore, and to do so I just hacked over the functionally of those methods from the 5.3 Kernel:
(note quite minimal hacky change as a POC fix to show the issue)

diff --git a/include/linux/simd_x86.h b/include/linux/simd_x86.h
index 5f243e0cc..08504ba92 100644
--- a/include/linux/simd_x86.h
+++ b/include/linux/simd_x86.h
@@ -179,7 +180,6 @@ kfpu_begin(void)
        preempt_disable();
        local_irq_disable();
 
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        /*
         * The current FPU registers need to be preserved by kfpu_begin()
         * and restored by kfpu_end().  This is required because we can
@@ -188,32 +188,51 @@ kfpu_begin(void)
         * context switch.
         */
        copy_fpregs_to_fpstate(&current->thread.fpu);
-#elif defined(HAVE_KERNEL_FPU_INITIALIZED)
        /*
         * There is no need to preserve and restore the FPU registers.
         * They will always be restored from the task's stored FPU state
         * when switching contexts.
         */
        WARN_ON_ONCE(current->thread.fpu.initialized == 0);
-#endif
 }
+#ifndef kernel_insn_err
+#define kernel_insn_err(insn, output, input...)                                \
+({                                                                     \
+       int err;                                                        \
+       asm volatile("1:" #insn "\n\t"                                  \
+                    "2:\n"                                             \
+                    ".section .fixup,\"ax\"\n"                         \
+                    "3:  movl $-1,%[err]\n"                            \
+                    "    jmp  2b\n"                                    \
+                    ".previous\n"                                      \
+                    _ASM_EXTABLE(1b, 3b)                               \
+                    : [err] "=r" (err), output                         \
+                    : "0"(0), input);                                  \
+       err;                                                            \
+})
+#endif
+
 
 static inline void
 kfpu_end(void)
 {
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        union fpregs_state *state = &current->thread.fpu.state;
-       int error;
+       int err = 0;
 
        if (use_xsave()) {
-               error = copy_kernel_to_xregs_err(&state->xsave, -1);
+               u32 lmask = -1;
+               u32 hmask = -1;
+               XSTATE_OP(XRSTOR, &state->xsave, lmask, hmask, err);
        } else if (use_fxsr()) {
-               error = copy_kernel_to_fxregs_err(&state->fxsave);
+               struct fxregs_state *fx = &state->fxsave;
+               if (IS_ENABLED(CONFIG_X86_32))
+                       err = kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" (*fx));
+               else
+                       err = kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx));
        } else {
-               error = copy_kernel_to_fregs_err(&state->fsave);
+               copy_kernel_to_fregs(&state->fsave);
        }
-       WARN_ON_ONCE(error);
-#endif
+       WARN_ON_ONCE(err);
 
        local_irq_enable();
        preempt_enable();

Related to the removal of the SIMD patch in the (future) 0.8.2 release #9161

shartge · 2019-09-25T11:51:53Z

With kernel 5.2 there are no errors.

I can reproduce this with mprime -t on Debian Buster running 5.2.9-2~bpo10+1 and zfs-dkms 0.8.1-4~bpo10+1 after ~1 minute of runtime:

[Worker #1 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #6 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #7 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #8 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #5 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #3 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #2 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:43] FATAL ERROR: Rounding was 4.029914356e+80, expected less than 0.4
[Worker #4 Sep 25 13:43] Hardware failure detected, consult stress.txt file.
[Worker #4 Sep 25 13:43] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #4 Sep 25 13:43] Worker stopped.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Stepping:            2
CPU MHz:             1201.117
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            6999.89
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            10240K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

behlendorf · 2019-09-25T17:43:49Z

@GamerSource thanks for digging in to this, that matches my understanding of the issue. What I don't quite understand yet is why this wasn't observed during the initial patch testing. It may be possible it was due to my specific kernel configuration. Regardless, I agree the fix here is going to need to be to save and restore the registers similar to the 5.2+ support.

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

shartge · 2019-09-25T17:49:16Z

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

I am 100% sure, as this Kernel 5.2.9-2~bpo10+1 was the only Kernel installed on that system at that moment.

Also the version I copy-pasted was directly from uname -a.

Edit: Interesting bit: I was not able to reproduce this with stress-ng, as @ggzengel was, but mprime triggered it right away.

Edit²: Here is the line stress-ng logged via syslog:

Sep 25 13:35:46 storage-01 stress-ng: system: 'storage-01' Linux 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64

This was 10 Minutes before my first comment. I let stress-ng run for ~6 minutes with a scrub running a the same time. When that did non show any fails, I retested with mprime -t at 13:42, which immediately hit the problem at 13:43.

Edit³: I also checked if the hardware is fine, of course. Without ZFS mprime -t ran for 2 hours without any errors.

behlendorf · 2019-09-25T18:15:15Z

@shartge would you mind checking the dkms build directory to verify that HAVE_KERNEL_TIF_NEED_FPU_LOAD was defined in the zfs_config.h file.

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

shartge · 2019-09-25T18:19:58Z

~~I will, but it will have to wait until tomorrow, because right now I have reverted the system back to 4.19 and 0.7.2 and I have to wait until the backup window has finished.~~ See #9346 (comment)

shartge · 2019-09-25T18:32:43Z

Scratch that, I don't need that specific system to test the build, I can just use any Debian Buster system for that, for example any of my test VMs.

Using Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux and zfs-dkms 0.8.1-4~bpo10+1 I get:

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

I am attaching the whole file in case it may be helpful.
zfs_config.h.txt

ggzengel · 2019-09-25T22:40:37Z

@shartge I had to reduce the CPUs to 18 for stress-ng because scrub was pausing while using all 32 CPUs.
I use n/2+2 CPUs because I have a NUMA system with 2 nodes.

shartge · 2019-09-26T05:37:50Z

I now did a real test with the VM I used to do the compile test in #9346 (comment) and I am able to reproduce the bug very fast.

Using a 4 disk RAIDZ and dd if=/dev/zero of=testdata.dat bs=16M while running mprime -t at the same time quickly results in

[Worker #4 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #3 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #2 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:34] FATAL ERROR: Rounding was 1944592149, expected less than 0.4
[Worker #1 Sep 26 07:34] Hardware failure detected, consult stress.txt file.
[Worker #1 Sep 26 07:34] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #1 Sep 26 07:34] Worker stopped.

CPU for this system is

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:            0
CPU MHz:             3092.734
BogoMIPS:            6185.46
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt arat md_clear flush_l1d arch_capabilities

Kernel and ZFS version can be found in #9346 (comment)

ggzengel · 2019-09-26T08:51:05Z

This is a vmware client.
Does vmware have special FPU&IRQ handling inside the kernel or have a bug?

shartge · 2019-09-26T08:54:29Z

This should not matter, as I can reproduce the same problem on 2 physical systems.

But because both of them are production storages, it is easier for me to do this in a VM, as long as it shows the same behaviour.

ggzengel · 2019-09-26T10:40:56Z

The worst thing is that inside KVM the VMs get FPU errors too even they don't use ZFS.
I started a Debian live CD inside Proxmox, installed stress-ng and get a lot of errors if I start ZFS scrub at the host.

shartge · 2019-09-26T11:40:43Z

This does not happen with VMware ESX for me. I've been running mprime -t in my Test-VM since 07:00 today and got not one single error.

Only when I have ZFS active and put I/O load on it, the FPU errors start to occur.

The same also happened for my with the two physical systems I used to test this.

ggzengel · 2019-09-26T12:34:08Z

@shartge Are you using ZFS on VMware host?

shartge · 2019-09-26T12:59:08Z

No!

I just quickly created a test VM to test the compilation of the module without the need to use and interrupt my production storages.

And I also tried to reproduce this issue here in a VM instead of a physical host, which, as I have show, I was successful in doing.

But, again: The error is reproducible on normal hardware with 5.2 and 0.8.1. (Using a VM is just more convenient.)

ggzengel · 2019-09-26T17:41:26Z

Summary:

This happens only with ZFS 8.X
FPU errors are always with kernel 4.19 - 5.1
It shouldn't be with kernel 5.2 but there are exceptions
3.1. @shartge gets FPU errors even with kernel 5.2 too
3.2. @alex-gh and I didn't get errors with kernel 5.2
I get FPU errors inside KVM-VM with ZFS 8.x and kernel 5.0 running at host side (Proxmox). There is no ZFS code inside the VM.
The workaround is:
5.1 run
echo scalar > /sys/module/zcommon/parameters/zfs_fletcher_4_impl
echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
5.2 for persistence add
zfs.zfs_vdev_raidz_impl=scalar zcommon.zfs_fletcher_4_impl=scalar
to kernel parameter (GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub on debian and run update_grub)

shartge · 2019-09-26T17:46:26Z

@shartge gets FPU errors even with kernel 5.2 too

Note that this is with the code patched by Debian for both the Kernel and ZFS. I have yet to try the vanilla ZFS code with 5.2.

It could very well be that the inclusion of e5db313 by Debian causes this.

ggzengel · 2019-09-26T17:49:15Z

With Buster and 5.2 I don't get the FPU errors but it's dom0 from XEN: #9346 (comment)

shartge · 2019-09-26T17:55:54Z

Who knows what Xen does with the FPU state in the Dom0. It could be a false negative as well.

ggzengel · 2019-09-26T20:53:32Z

Now I checked it with 5.2 and without XEN. No FPU errors.

# cat /etc/apt/sources.list | grep -vE "^$|^#"
deb http://deb.debian.org/debian/ buster main non-free contrib
deb http://security.debian.org/debian-security buster/updates main contrib non-free
deb http://deb.debian.org/debian/ buster-updates main contrib non-free
deb http://deb.debian.org/debian/ buster-backports main contrib non-free

# dkms status
zfs, 0.8.1, 4.19.0-6-amd64, x86_64: installed
zfs, 0.8.1, 5.2.0-0.bpo.2-amd64, x86_64: installed

# uname -a
Linux xenserver2.donner14.private 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2599.803
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4799.63
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

# modinfo zfs
filename:       /lib/modules/5.2.0-0.bpo.2-amd64/updates/dkms/zfs.ko
version:        0.8.1-4~bpo10+1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
alias:          devname:zfs
alias:          char-major-10-249
srcversion:     FA9BDA7077DD9A40222C4B8
depends:        spl,znvpair,icp,zlua,zunicode,zcommon,zavl
retpoline:      Y
name:           zfs

# apt list | grep zfs | grep installed
libzfs2linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfs-dkms/buster-backports,now 0.8.1-4~bpo10+1 all [installed,automatic]
zfs-zed/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfsutils-linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed]

shartge · 2019-10-07T10:35:08Z

But what might be worth it is disabling AVX512 support in Prime95. So you should see it using "FMA3 FFT" (which means using AVX2) and not "AVX-512 FFT". I don't know if it would make a difference but it seems like a good idea to me, when you using AVX2 code in ZFS.

You can do that by putting "CpuSupportsAVX512F=0" in local.txt (https://www.tomshardware.com/reviews/stress-test-cpu-pc-guide,5461-2.html)

Switching to AVX2/FMA3-FFT for mprime and using "fastest" (i.e AVX512) in ZFS also creates errors.

Switching ZFS to AVX2 while keeping mprime also at AVX2 creates errors, too.

And finally, setting ZFS to "ssse3" and keeping mprime at AVX2 still creates errors.

But @Fabian-Gruenbichler was able to reproduce this, so I can finally stop doubting myself.

shartge · 2019-10-07T11:00:12Z

Interesting observation:

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

For local.txt:

CpuSupportsAVX512F=0
CpuSupportsAVX2=0
CpuSupportsFMA4=0
CpuSupportsFMA3=0
CpuSupportsAVX=0

And mprime passes all three self tests:


[Worker #1 Oct 7 12:43] Test 1, 3100 Lucas-Lehmer iterations of M21871519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:44] Test 2, 3100 Lucas-Lehmer in-place iterations of M20971521 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:44] Test 3, 3100 Lucas-Lehmer iterations of M20971519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:45] Test 4, 4000 Lucas-Lehmer in-place iterations of M19922945 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:46] Self-test 1120K passed!
[Worker #1 Oct 7 12:46] Test 1, 1600000 Lucas-Lehmer in-place iterations of M83839 using FFT length 4K.
[Worker #1 Oct 7 12:47] Test 2, 1600000 Lucas-Lehmer in-place iterations of M82031 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 3, 1600000 Lucas-Lehmer in-place iterations of M79745 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 4, 1600000 Lucas-Lehmer in-place iterations of M77455 using FFT length 4K.
[Worker #1 Oct 7 12:49] Self-test 4K passed!
[Worker #1 Oct 7 12:49] Test 1, 1120000 Lucas-Lehmer in-place iterations of M107519 using FFT length 5K.
[Worker #1 Oct 7 12:50] Test 2, 1120000 Lucas-Lehmer in-place iterations of M106497 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 3, 1120000 Lucas-Lehmer in-place iterations of M104447 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 4, 1120000 Lucas-Lehmer in-place iterations of M102401 using FFT length 5K.
[Worker #1 Oct 7 12:52] Self-test 5K passed!

As soon as I enable anything above SSE2, starting with AVX, the errors return.

vstax · 2019-10-07T11:18:57Z

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

In SSE modes XMM registers are used, which are lower half of AVX (YMM) registers (or lower quad of AVX512 ZMM registers). Since this issue seems to be about saving/restoring registers when switching threads, using only lower part of register technically shouldn't change anything. If Prime95 is actually using SSE2 instructions, that is...

But maybe, just maybe, I'm really speculating here, kernel actually does save/restore on SSE (XMM) registers so the problem does not appear when Prime95 is only using XMM registers. It's upper part of YMM register that causes problem, that is, only SSE registers are saved/restored instead of whole 256 bit AVX ones. I don't know if this is possible :) Just thought I'd share an idea.

EDIT: this could happen if FXSAVE instruction which is called explicitly by #9406 works as expected but XSAVE feature in kernel doesn't work or isn't called correctly for some reason.

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to unconditionally save and restore the FPU state. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #9346 Closes #9403

vorsich-tiger · 2019-10-24T23:38:37Z

I'd like to throw in a few words just before any final "fix" is committed and routed forward:
There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs.
I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels.
I.e. any 5.2 host running a VM indicating a zfs-bug just might deliver a false positive.
I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.

https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/
i.e.
https://bugzilla.kernel.org/show_bug.cgi?id=204209
https://lkml.org/lkml/2019/7/17/758
etc.

Fabian-Gruenbichler · 2019-10-25T05:42:06Z

On October 25, 2019 1:38 am, vorsich-tiger wrote: I'd like to throw in a few words just before any final "fix" is committed and routed forward: There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs. I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels. I.e. any 5.2 host running a VM indicating a zfs-bug just *might* deliver a false positive. I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure. https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ i.e. https://bugzilla.kernel.org/show_bug.cgi?id=204209 https://lkml.org/lkml/2019/7/17/758 etc.

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

Fabian-Gruenbichler · 2019-10-25T06:16:12Z

On October 25, 2019 7:41 am, Fabian Grünbichler wrote: On October 25, 2019 1:38 am, vorsich-tiger wrote: > I'd like to throw in a few words just before any final "fix" is committed and routed forward: > There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs. > I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels. > I.e. any 5.2 host running a VM indicating a zfs-bug just *might* deliver a false positive. > I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure. > > https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ > i.e. > https://bugzilla.kernel.org/show_bug.cgi?id=204209 > https://lkml.org/lkml/2019/7/17/758 > etc. the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

and the fixed 5.2/5.3 kernels are affected as well (the fix is contained in 5.2.5 and all released 5.3 versions).

vorsich-tiger · 2019-10-25T10:39:01Z

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make.
1.
I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true.
2.
I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2
3.
It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken.
I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

shartge · 2019-10-25T10:56:18Z

i.e. just maybe zfs SIMD for 5.2+ is really not broken.

Negative on that. The same problem can be reproduced on non-KVM running baremetal hosts using 5.2+.

Fabian-Gruenbichler · 2019-10-25T11:23:26Z

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make.
1.
I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true.
2.
I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2
3.
It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken.
I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

I did not misunderstand your post. I am one of the devs who triaged this bug initially, analyzed the old code, verified a workaround on our downstream side, and reviewed the now merged fix 😉

see the detailed testing report (on baremetal!) over at
#9406 (comment)

the approach that was used for 5.2 was in theory sound for < 5.2, but not workable for GPL/license reasons. it was broken for 5.2+ though, as was the approach for < 5.2 on < 5.2 kernels. the only thing that really worked was the kernel-only solution (and a combination of 5.2+ approach with helper backports on < 5.2 kernels).

in other words, it was broken all around, irrespective of other FPU-related breakage on some 5.2 versions..

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 Closes openzfs#9403

behlendorf · 2019-10-28T16:44:50Z

PR #9515 contains an 0.8 backport of the fix applied to master.

shartge · 2019-10-28T17:34:12Z

PR #9515 contains an 0.8 backport of the fix applied to master.

I will be able to test this on my systems tomorrow GMT morning.

shartge · 2019-10-29T21:08:32Z

I've had PR #9515 applied on top of the zfs-0.8-release branch on my test VM and one physical system, both first running for 4 hours on 5.2.0-bpo from Debian and then another 5 hours on 4.19 also from Debian and could no longer reproduce #9346.

From my point of view this looks very very promising.

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 Closes openzfs#9403

mvrhov · 2019-12-24T12:43:17Z

Is this by any chance the same bug the Go authors found: https://bugzilla.kernel.org/show_bug.cgi?id=205663#c2

behlendorf · 2019-12-24T18:08:13Z

@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.

Fabian-Gruenbichler · 2020-01-02T06:52:00Z

On December 24, 2019 7:08 pm, Brian Behlendorf wrote: @mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.

I saw that while triaging (I think I even linked it in one of the issues as possible culprit?) but quickly ruled it out. Might have affected some user reports though, if they (/their distro) used the affected kernel+gcc versions.

Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to unconditionally save and restore the FPU state. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 (cherry picked from commit 813fd01) Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>

behlendorf · 2020-01-24T01:34:48Z

Closing. The SIMD patches have been included in the 0.8.3 release.

At the moment we experience bad instabilities with linux 5.3: openzfs/zfs#9346 as the zfs-native method of disabling the FPU is buggy. (cherry picked from commit 96097ab)

- 2000-increase-default-zcmd-allocation-to-256K.patch - git_fix_mount_race.patch Remove patch: - linux-5.0-simd-compat.patch which causes #940932 under some certain conditions. openzfs/zfs#9346 Gbp-Dch: Full

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 23, 2019

aerusso mentioned this issue Sep 24, 2019

DO NOT MERGE Reproduce data corruption bug in Debian #9354

Closed

13 tasks

tonyhutter mentioned this issue Sep 24, 2019

zfs-0.8.2 patchset #9161

Closed

12 tasks

MikeCockrem mentioned this issue Nov 21, 2019

ZOL + encryption - crashes under heavy I/O #9603

Closed

behlendorf closed this as completed Jan 24, 2020

Mic92 added a commit to Mic92/nixpkgs that referenced this issue Feb 13, 2020

linux: update fpu patches for 5.3

451e319

At the moment we experience bad instabilities with linux 5.3: openzfs/zfs#9346 as the zfs-native method of disabling the FPU is buggy. (cherry picked from commit 96097ab)

rincebrain mentioned this issue Jun 1, 2021

use ARM crypto extensions to improve ZFS encryption performance #12171

Open

Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346

Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346

Comments

aerusso commented Sep 22, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

rincebrain commented Sep 22, 2019 • edited Loading

alex-gh commented Sep 23, 2019 • edited Loading

behlendorf commented Sep 23, 2019

rincebrain commented Sep 24, 2019 • edited Loading

alex-gh commented Sep 24, 2019

ggzengel commented Sep 25, 2019

ThomasLamprecht commented Sep 25, 2019 • edited Loading

shartge commented Sep 25, 2019 • edited Loading

behlendorf commented Sep 25, 2019

shartge commented Sep 25, 2019 • edited Loading

behlendorf commented Sep 25, 2019

shartge commented Sep 25, 2019 • edited Loading

shartge commented Sep 25, 2019

ggzengel commented Sep 25, 2019

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019 • edited Loading

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019

shartge commented Sep 26, 2019

ggzengel commented Sep 26, 2019

shartge commented Oct 7, 2019

shartge commented Oct 7, 2019

vstax commented Oct 7, 2019 • edited Loading

vorsich-tiger commented Oct 24, 2019

Fabian-Gruenbichler commented Oct 25, 2019 via email

Fabian-Gruenbichler commented Oct 25, 2019 via email

vorsich-tiger commented Oct 25, 2019

shartge commented Oct 25, 2019

Fabian-Gruenbichler commented Oct 25, 2019

behlendorf commented Oct 28, 2019

shartge commented Oct 28, 2019

shartge commented Oct 29, 2019

mvrhov commented Dec 24, 2019

behlendorf commented Dec 24, 2019

Fabian-Gruenbichler commented Jan 2, 2020 via email

behlendorf commented Jan 24, 2020

rincebrain commented Sep 22, 2019 •

edited

Loading

alex-gh commented Sep 23, 2019 •

edited

Loading

rincebrain commented Sep 24, 2019 •

edited

Loading

ThomasLamprecht commented Sep 25, 2019 •

edited

Loading

shartge commented Sep 25, 2019 •

edited

Loading

shartge commented Sep 25, 2019 •

edited

Loading

shartge commented Sep 25, 2019 •

edited

Loading

ggzengel commented Sep 26, 2019 •

edited

Loading

vstax commented Oct 7, 2019 •

edited

Loading