Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS data corruption #3990

Closed
gkkovacs opened this issue Nov 6, 2015 · 79 comments
Closed

ZFS data corruption #3990

gkkovacs opened this issue Nov 6, 2015 · 79 comments

Comments

@gkkovacs
Copy link

@gkkovacs gkkovacs commented Nov 6, 2015

I have installed Proxmox 4 (zfs 0.6.5) on a server using ZFS RAID10 in the installer. The disks are brand new (4x2GB, attached to the Intel motherboard SATA connectors), and there are no SMART errors / reallocated sectors on them. I have run a memtest for 30 minutes, everything seems fine hardware-wise.

zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             92.8G  3.42T    96K  /rpool
rpool/ROOT        59.8G  3.42T    96K  /rpool/ROOT
rpool/ROOT/pve-1  59.8G  3.42T  59.8G  /
rpool/swap        33.0G  3.45T   144K  -

After restoring a few Vms (a hundred or so gigabytes), the system reported read errors in some files. Scrubbing the pool shows permanent read errors in the recently restored guest files:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov  5 21:30:02 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/501/vm-501-disk-1.qcow2

If I delete the VMs and scrub the pool again, the errors are gone. If I restore new VMs, the errors are back. Anybody have any idea what could be happening here?

zdb -mcv rpool
Traversing all blocks to verify checksums and verify nothing leaked ...

loading space map for vdev 1 of 2, metaslab 30 of 116 ...
50.1G completed ( 143MB/s) estimated time remaining: 0hr 01min 09sec        zdb_blkptr_cb: Got error 52 reading <50, 61726, 0, 514eb>  -- skipping
59.8G completed ( 145MB/s) estimated time remaining: 0hr 00min 00sec
Error counts:

    errno  count
       52  1

    No leaks (block sum matches space maps exactly)

    bp count:          928688
    ganged count:           0
    bp logical:    115011845632      avg: 123843
    bp physical:   62866980352      avg:  67694     compression:   1.83
    bp allocated:  64258899968      avg:  69193     compression:   1.79
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 64258899968     used:  1.61%

    additional, non-pointer bps of type 0:       4844
    Dittoed blocks on same vdev: 297
@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Nov 6, 2015

What's new in Proxmox VE 4.0

  • Debian Jessie 8.2 and 4.2 Linux kernel
  • Linux Containers (LXC)
@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Nov 6, 2015

any mcelog or edac-utils errors ? dmesg errors for ECC RAM ?

are the SATA-cables fine ? (your SMART tests indicate: yes)

any known NCQ and/or firmware issues with the drives ? what drives ?

hardware (mainboard) info ? kernel info ?

update to 0.6.5.2 or 0.6.5.3 available ?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 6, 2015

I have installed mcelog, and repeated the restore of a couple of VMs until the errors surface again. /var/log/mcelog is empty.

dmesg has nothing suspicious, here is /var/log/messages from last boot:
http://pastebin.com/7PkNUnxr

SATA cables should be fine, I have initiated even SMART self diagnostics of the drives, nothing pops up. The drives are brand new 2GB Toshiba DT01ACA200 models. Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU. There is also a 3TB Toshiba HDD (not part of the pool), and a Samsung SSD for ZIL / L2ARC (not used at the moment).

According to messages, ZFS: Loaded module v0.6.5.2-55_g61d75b3
Will try upgrading to a later release and retest.

@tomposmiko

This comment has been minimized.

Copy link

@tomposmiko tomposmiko commented Nov 6, 2015

You use zfs 0.6.5.

0d00e81

Is that not, what you're looking for?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 6, 2015

@kernelOfTruth I have upgraded to 0.6.5.3, rebooted the system and repeated the restores on an error-free pool.

Nov  6 13:14:28 proxmox3 kernel: [    5.358851] ZFS: Loaded module v0.6.5.3-1, ZFS pool version 5000, ZFS filesystem version 5

Unfortunately, the file corruption is happening again, mcelog is still empty.

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h3m with 1 errors on Fri Nov  6 13:38:47 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/400/vm-400-disk-1.qcow2

Any other ideas?

@tomposmiko

This comment has been minimized.

Copy link

@tomposmiko tomposmiko commented Nov 6, 2015

OK, attentive enough and thought the error is inside the VM....and in addition you use file image for the VM. Sorry about that.

Are you sure it's not a HW failure? My bet also would be on that.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 7, 2015

@kernelOfTruth @tomposmiko

After the upgrade to 0.6.5.3 proved ineffective, I have removed the trays and backplane from the server, and connected the drives directly to the mainboard with new, shorter SATA3 cables. Still, the errors come back.

What I don't understand is this: if a cable were bad, and would only introduce an error sometimes, it still would not be able to corrupt both copies of the actual data (we are running RAID10, so all data must be mirrored). So how is it possible that after copying a single large file to the array, there is suddenly a "Permanent error" in it, meaning ZFS is unable to correct it even if it has TWO COPIES of it? Logically thinking, this issue can't be caused by cables or drives, since the chance of two cables or two drives causing an error in the exact same place or time is essentially zero (hence the corrupted data would be correctable).

On the Proxmox forum people keep saying this could easily be a memory (or even CPU) error, but I kind of doubt that, since this server has been running Windows with Intel RAID for two years without any issues, and nothing has changed apart from the drives.

I'm baffled at the moment, not sure what to do next. If you (or anyone) got any more ideas about what and how to test, I certainly welcome them.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Nov 7, 2015

@gkkovacs I had similar errors in the past when my Motherboard and Processor weren't fully supported by Linux (haswell and the EDAC components, ECC RAM) and I got errors almost everywhere on mirrors or additional backup media - there were even DIMM errors in dmesg

1-2 kernel releases later it went away and never came back

Are you running the HD Graphics 3000 of the GPU ?

Please configure your memory and MTRR optimally

like e.g. so via kernel append

enable_mtrr_cleanup mtrr_spare_reg_nr=1 mtrr_gran_size=64K mtrr_chunk_size=16M

so that "lose cover RAM" equals 0G

I doubt that this leads to errors but I've had weird behavior with the iGPU enabled and these errors being shown.

Still you might want to do some memory, CPU or system stress tests,

ZFS is known to stress all components in more intense ways than usual and thus expose hardware defects early on.

Googling for

zdb_blkptr_cb: Got error 52 reading -- skipping

lead to a few hits about metadata corruption and some older reports but it wasn't so clear ...

30 minutes of mem test is clearly too short, it needs at least 3 passes (memtest86+ or other) [http://forum.proxmox.com/threads/24358-BUG-ZFS-data-corruption-on-Proxmox-4?p=122583#post122583],

I've read recommendations of 24-72 hours to be on the safe side (depending on the amount of RAM of course and a sane number of passes, stressing)

Other factors:

Check: Mainboard, PSU, RAM, CPU, iommu, dedicated GPU instead of iGPU, driver issues that lead to data or metadata corruption, slub_nomerge memory_corruption_check=1, NCQ, libata driver, ...

@tomposmiko

This comment has been minimized.

Copy link

@tomposmiko tomposmiko commented Nov 7, 2015

I just noticed, what you wrote:

Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU.

Check your memory seriously. But I personally would not use ZFS with non-ECC RAM.

Anyway I have similar errors, in a machine with poor seagate SATA disks. In that case those disks go down and need replacing from time to time.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 10, 2015

@kernelOfTruth @tomposmiko

I have done the MTRR kernel configuration, unfortunately there was no possible way to achieve that lose cover RAM is zero (CPU only has 7 registers, would need 9 or 10 for that). Nevertheless It did not help the data corruption issue.

Here is the latest /var/log/messages
http://pastebin.com/VJRfF3U0

I have tested the RAM with memtest for another 9 hours (no errors ofcourse), but I have found something: if I only use a 2 DIMMS, there is no data corruption error. RAM speed (1066 or 1333), size (8 GB, 16 GB, 24 GB, 32 GB) or timings (CL9, CL10) were all tested, but don't matter, only the number of DIMMs.

4 DIMMs = data corruption, 2 DIMMS = no problem. The rate is approximately 4-6 checksum errors per 100 GB written (checksum errors are in all member drives, therefore uncorrectable). Have you seen anything like this?

Later I read about the H67 SATA port issue that plagued the early versions of this chipset, not sure if applies here, but I have ordered a new motherboard. Fingers crossed.

@shoeper

This comment has been minimized.

Copy link

@shoeper shoeper commented Nov 11, 2015

Does it matter in which slots you put in the DIMMs? Maybe one of the 4 slots is malfunctional while the ram works.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Nov 12, 2015

Agreed or perhaps the two pairs are incompatible ? Are they both of the same model ? Is the mainboard chipset (or type) known to have issues with the specific RAM you're using ?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 26, 2015

@kernelOfTruth @tomposmiko @shoeper

Things are getting more interesting. I have installed a replacement motherboard (Q77 this time), and lo and behold the corruption is still there! So it's not the motherboard / chipset then.

I have done some RAM LEGO again: with 1 or 2 DIMMs (doesn't matter what size or speed or slot, tried several different pairs) there is no corruption, with 4 DIMMs the data gets corrupted. So it's not the RAM then.

What's left, really? The CPU? Started to fiddle with BIOS settings, and when I disabled Turbo and EIST (Enhanced Intel Speedstep Technology), the corruption did not happen became less likely! So it's either a kernel / ZFS regression that is happening when the CPU scales up and down AND all the RAM slots are in use, or my CPU is defective. Will test with another CPU this weekend.

@aikudinov

This comment has been minimized.

Copy link

@aikudinov aikudinov commented Nov 27, 2015

I've seen a box with 2 identical Kingston DIMMs about 5-7 years ago that caused Windows crashing to BSOD randomly(maybe once a day, sometimes more or less often), removing either one of them solved the problem. Custom built PC, but had a warranty, so they said it is some kind of incompatibility and exchanged both DIMMs for a different brand.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 27, 2015

@aikudinov

As I wrote above, I have tried several different DIMMS, with varying sizes, speeds and manufacturers. All of them produce the data corruption issue if 4 is installed, and none of them if 1 or 2.

@dswartz

This comment has been minimized.

Copy link
Contributor

@dswartz dswartz commented Nov 27, 2015

That had got took be done kind of chipset or bios issue?

gkkovacs notifications@github.com wrote:

@aikudinov

As I wrote above, I have tried several different DIMMS, with varying sizes, speeds and manufacturers. All of them produce the data corruption issue if 4 is installed, and none of them if 1 or 2.


Reply to this email directly or view it on GitHub.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Nov 28, 2015

@dswartz

Should you have read the thread first, it would have become apparent that I have since replaced the motherboard with another model (first it was H67, now it's Q77), and I also tried a number of BIOS settings. It's NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a CPU issue...

@fling-

This comment has been minimized.

Copy link
Contributor

@fling- fling- commented Nov 28, 2015

I'm experiencing i/o errors with 4.2 but they disappear after a reboot.
No errors with 4.1

@dswartz

This comment has been minimized.

Copy link
Contributor

@dswartz dswartz commented Nov 28, 2015

I thought it might be a team boundary issue but you said you tried different sizes of ram, so... yeah at this point a different cpu is the only thing I can think of...

gkkovacs notifications@github.com wrote:

@dswartz

Should you have read the thread first, it would have become apparent that I have since replaced the motherboard with another model (first it was H67, now it's Q77), and I also tried a number of BIOS settings. It's NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a CPU issue...


Reply to this email directly or view it on GitHub.

@Stoatwblr

This comment has been minimized.

Copy link

@Stoatwblr Stoatwblr commented Nov 28, 2015

On 28/11/15 14:07, dswartz wrote:

Should you have read the thread first, it would have become apparent
that I have since replaced the motherboard with another model (first it
was H67, now it's Q77), and I also tried a number of BIOS settings. It's
NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a
CPU issue...

These are consumer-grade chipsets, cpu, etc and are more likely to have
bitflip errors than server-grade ones.

Have you run full sets of memory checks? (memtest86+ and friends,
multiple iterations over a few days)

Then there's the PSU. I've seen a number of issues of random data loss
which were fixed by replacing this. The ability of many to cope with
load spikes is surprisingly poor.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Dec 1, 2015

@Stoatwblr @kernelOfTruth @behlendorf @dswartz

After weeks of testing, I have concluded that this is most likely a software issue: ZFS on the 4.2 kernel produce irreparable checksum errors under the following conditions:

  • Linux kernel 4.2 (4.2.3-2-pve)
  • ZFS version 0.6.5.3 (from pvetest)
  • Proxmox installed on ZFS 4-disk RAID10
  • Sandy Bridge CPU (i5-2500K and i7-2600) and compatible chipset (H67 and Q77)
  • 4 memory modules must be installed in the motherboard
  • several hundred gigabytes of VMs copied to the pool (restored from NFS)
  • scrubbing the pool after shows uncorrectable errors on disk

I can't reproduce it on another, similar box with different drives, but another user reported a very similar issue: http://list.zfsonlinux.org/pipermail/zfs-discuss/2015-November/023883.html

Why it's not a faulty disk / cable:
All the disks are brand new, and self-diagnosed with SMART several times (not a single error), also cables were replaced early on. Please note that checksum errors get created in the same numbers on the mirror members, so the blocks that get written out are already corrupted in memory.

Why it's not a faulty memory module:
I have run 4 hours of SMP and 9 hours of single-core memtest86 on the originally installed memory. Also the errors come out with any kind of memory modules, I have tested at least 5 different pairs of DDR3 DIMMs (3 manufacturers, 4GB and 8GB sizes) in several configurations.

Why it's not a faulty motherboard / chipset / CPU / PSU:
I have ordered an Intel Q77 motherboard to replace the ASUS H67 motherboard used previously. After it produced the same errors I have tried another, different CPU as well, same result. I even replaced the PSU with another one, no luck.

I have replaced every single piece of hardware in my machine apart from the drives. The only hardware that is connected to this issue for sure is the number of memory modules installed: 4 DIMMs produce the checksum errors, 2 DIMMs do not.

I am out of options, still looking for ideas on what to test. If a ZFS developer wants to look into this system, I can keep it online for a few days, otherwise I will accept defeat and reinstall it with a HW RAID card and ext4/LVM.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Dec 2, 2015

@gkkovacs I'm sure the following issue should be fixed by now, right ?

http://techreport.com/news/20326/intel-finds-flaw-in-sandy-bridge-chipsets-halts-shipments

also it was related to the S-ATA bus - so it's highly unlikely that it's that issue

also it occurs on the Q77 chipset and i7-2600 ...

Yes, would be interesting to see what Nemesiz did to fix his problem

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Dec 6, 2015

@kernelOfTruth

Yes, the H67 SATA issue is fixed, not applicable to Q77.
Since last time I tested the following things:

Disable C-states
I have put the intel_idle.max_cstate=0 kernel option into grub, and verified with i7z that the CPU did not go below C1 at all. Unfortunately, the checksum errors still get created.

Adaptec controller insted of Intel ICH
I have installed an Adaptec SAS RAID controller, configured the disks as simple volumes, and reinstalled Proxmox with ZFS RAID10 (the original setup), so I could eliminate the Intel ICH SATA controller from the mix. Unfortunately, the checksum errors are still there.

@behlendorf

This comment has been minimized.

Copy link
Member

@behlendorf behlendorf commented Dec 10, 2015

@gkkovacs I know I'm late to the conversation but it definitely looks like you've eliminated almost everything except software. Have you tried using older Linux kernels and older versions of ZoL to determine when and where this issue was introduced?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Dec 10, 2015

@behlendorf

Regarding kernels: I only used the kernel (4.2) that comes with Proxmox 4, because that's the platform our servers are based on. This particular server has since been reinstalled with LVM/ext4, so I can't test it with ZFS anymore.

However we have another, very similar server that's going to be reinstalled, and since Proxmox 3.4 also supports ZFS with it's RHEL6-based 2.6.32 kernel, I can try to reproduce the problem on it under both kernels. Will report back in a few days.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Dec 10, 2015

@gkkovacs the only way to rule out an hardware error (close to 100%, not certain if Btrfs would stress all components in a similar strenuous way) would be to have another checksumming filesystem in a similar configuration which not only does checksums on metadata but also data (e.g. Btrfs)

@behlendorf

This comment has been minimized.

Copy link
Member

@behlendorf behlendorf commented Dec 10, 2015

@gkkovacs please let us know what you determine because if there is a software bug in ZFS or in the Linux 4.2 kernel we definitely want to know about it. And the best way to determine that is to roll back to an older Linux kernel and/or version of ZFS. Finally, and I'm sure you're aware of this, but if this is a 4.2 kernel issue then you may end up having a similar problem with LVM/ext4 and just no way to detect it.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Dec 10, 2015

@kernelOfTruth Unfortunately the Proxmox installer does not support Btrfs, and I don't really have the time or the motivation to test it beyond that, since Btrfs has many other issues that exclude it from our use-case.

@behlendorf I have tested the LVM/ext4 setup extensively by copying several hundred gigabytes to the filesystem (just like I did with ZFS), and compared checksums of the files with checksums computed on the source. Not a single checksum difference was detected, while with ZFS there were already dozens on the same size of data.

@priyadarshan

This comment has been minimized.

Copy link

@priyadarshan priyadarshan commented Dec 25, 2015

I am experiencing similar symptoms on a HP z620 workstation, 16GB ECC ram.

We have a 12TB pool, created on FreeBSD 10.2. I scrubbed it, with no issues.
As we need to use Linux, I did the following:

  1. I have installed Ubuntu 15.10, with latest stable ZOL from ppa, kernel 4.2, on a second boot drive
  2. Imported the pool
  3. Move some data to it
  4. Scrubbed it again. zpool status reports one drive as unrecoverable. Also, the boot filesystem (ext4) suddendly becomes read-only, shutting off ssh connection. Even sudo from local terminal becomes impossible.
  5. But, after rebooting in single-user mode, doing a fsck reports no issues at all.
  6. I then rebooted, this time from FreeBSD, imported pool again. zpool status reported resilvering process (a few hundred Kbs)
  7. Then, I scrubbed again, this time zpool status reports no defects.
    sudo zpool status

      pool: tank
     state: ONLINE
      scan: none requested
    config:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_WD-WX11D557YNXY  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_WD-WX41D95PASRY  ONLINE       0     0     0
          mirror-1                                    ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_WD-WX41D95PAHV2  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNKXL  ONLINE       0     0     0

    errors: No known data errors

I repetead this three times (from 1 to 7), with exact same results.

For now, we can't use Linux, since ZFS access is a must for us. We need to stay on FreeBSD for a little while longer, but I wanted to report it here for others.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 23, 2016

@behlendorf @kernelOfTruth @priyadarshan

To test this issue extensively on different ZFS and kernel versions, we pulled this server from production once again, and thanks to the help of Dietmar Maurer from Proxmox we were able to test the following configurations on the same hardware:

  • Proxmox 3.4 / kernel 2.6.32 (-39) / ZFS 0.6.4.1
  • Proxmox 3.4 / kernel 2.6.32 (-44) / ZFS 0.6.5.4
  • Proxmox 4.1 / kernel 4.2.6 / ZFS 0.6.5.4

After a clean install of Proxmox on four disk ZFS RAID10, we restored a 126GB OpenVZ container backup from NFS, then scrubbed the pool. Looks like the checksum error issue affects all above kernel and ZFS versions, so the problem is most likely in ZFS.

Hardware is same as before: Q77 motherboard, Core i7-2600 CPU, 4x 8GB RAM, Adaptec 6805E controller used in JBOD/simple volume mode, 4x 2TB Toshiba HDD.

Linux proxmox 2.6.32-39-pve #1 SMP Fri May 8 11:27:35 CEST 2015 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5

     NAME                                             STATE     READ WRITE CKSUM
     rpool                                            ONLINE       0     0    35
       mirror-0                                       ONLINE       0     0    34
         scsi-SAdaptec_Morphed_JBOD_00FABE6527-part2  ONLINE       0     0    42
         scsi-SAdaptec_Morphed_JBOD_01E1CE6527-part2  ONLINE       0     0    44
       mirror-1                                       ONLINE       0     0    36
         scsi-SAdaptec_Morphed_JBOD_025EDA6527        ONLINE       0     0    48
         scsi-SAdaptec_Morphed_JBOD_0347E66527        ONLINE       0     0    45

Linux proxmox 2.6.32-44-pve #1 SMP Sun Jan 17 15:59:36 CET 2016 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.5.4-1, ZFS pool version 5000, ZFS filesystem version 5

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0    30
      mirror-0  ONLINE       0     0    16
        sda2    ONLINE       0     0    27
        sdb2    ONLINE       0     0    31
      mirror-1  ONLINE       0     0    44
        sdc     ONLINE       0     0    53
        sdd     ONLINE       0     0    54

Linux proxmox 4.2.6-1-pve #1 SMP Thu Jan 21 09:34:06 CET 2016 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.5.4-1, ZFS pool version 5000, ZFS filesystem version 5

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0    11
      mirror-0  ONLINE       0     0     8
        sda2    ONLINE       0     0    11
        sdb2    ONLINE       0     0    11
      mirror-1  ONLINE       0     0    14
        sdc     ONLINE       0     0    18
        sdd     ONLINE       0     0    19
@7Z0t99

This comment has been minimized.

Copy link

@7Z0t99 7Z0t99 commented Jan 28, 2016

It might make sense to contact some linux kernel or the btrfs mailing list, especially since you can reproduce it using btfs. I guess kernel developers are always a bit skeptic about out of tree modules like ZFS

@shoeper

This comment has been minimized.

Copy link

@shoeper shoeper commented Jan 28, 2016

What about row hammering? Could the ZFS workflow possibly lead to RAM bitflips? Maybe you could test it with https://github.com/google/rowhammer-test or some other test.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Jan 28, 2016

@shoeper good idea,

thought about that too, but then discarded the idea (it's just a simply "transfer", right ?)

but let's see what that leads to

There's still the factor of NFS, that also dweezil mentioned

During testing, we tried moving data to the ZFS pool from NFS (restoring virtual machines and containers via Proxmox web interface), file copy with MC from a single disk, and copy through SSH from another server as well.

@gkkovacs so that means you already tried to copy files on the server locally and then verify if the checksums changed, correct ?

and that also lead to the issues ?

if not it could be an issue with the network (ethernet adapter) driver ...

@shoeper

This comment has been minimized.

Copy link

@shoeper shoeper commented Jan 28, 2016

@kernelOfTruth

if not it could be an issue with the network (ethernet adapter) driver ...

wouldn't ZFS in this case also calculate a checksum of the wrong file and think the files is correct afterwards? How should ZFS now a checksum it never had?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 28, 2016

@kernelOfTruth I have tried copying files from external SATA drive to the ZFS and Btrfs pools, thereby completely escaping the network.

@7Z0t99 Good idea, I might write an essay about this for the Btrfs developers.

@shoeper I have tried rowhammer-test, it exited (detected a bit-flip) after 110 seconds on the first run.

 Iteration 98 (after 110.02s)
   Took 101.7 ms per address set
   Took 1.01711 sec in total for 10 address sets
   Took 23.544 nanosec per memory access (for 43200000 memory accesses)
   This gives 339786 accesses per address per 64 ms refresh period
 error at 0x7f0c84d2a808: got 0xffffffffefffffff
   Checking for bit flips took 0.104848 sec
 ** exited with status 256 (0x100)

If I understand correctly this does not mean that my RAM is defective, since rowhammer puts an extremely high stress on a single row of DRAM, about 100 thousand times a second, which would never happen during a normal workload. According to the information I came across this is a general vulnerability of DDR3, which got fixed in the DDR4 standard.

@drescherjm

This comment has been minimized.

Copy link

@drescherjm drescherjm commented Jan 28, 2016

Does this memory test also fail with only 2 of the 4 dimms installed? You may have to run it much longer.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Jan 28, 2016

@shoeper I was thinking along of issues in a broader sense:

general memory corruption due to buggy driver,

in the case of file transfers - yeah, there should be checksums & thus correct

slub_nomerge should partially account for buggy components - but in this case it didn't offer any help

@drescherjm

Currently there are 4x 8GB DDR3-1333 modules installed, 2 Kingston and 2 Corsair DIMMs IIRC. I had tried 4x 4GB 1333 configuration (Kingmax), 2x 4GB 1600 (Kingston IIRC), 4x 8GB 1600 (Kingston), and many combinations of these. 1 or 2 DIMMs never showed checksum errors, 4 always.

@gkkovacs how about 3 modules ? (e.g. 2x4 + 1x8, if you have those available)

Are 2 DIMMs and 4 DIMMs each time placed in dual channel setup ?

that's the thing I'm permanently keeping in the back of my head: trouble with dual channel memory with full utilization

@heyjonathan

This comment has been minimized.

Copy link

@heyjonathan heyjonathan commented Jan 28, 2016

On Fri, Nov 6, 2015 at 6:17 AM, gkkovacs notifications@github.com wrote:

Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU

I admit I know almost nothing here, but want to double check. All of the
testing you've done to eliminate bad RAM has involved swapping various
combinations of non-ECC ram with other combinations of non-ECC ram?

What happens when you use ECC ram?

Jonathan

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 28, 2016

@kernelOfTruth I think @drescherjm meant testing rowhammer with 2 DIMMs only. I can certainly try that, although failing rowhammer does not mean your RAM is defective, it simply exploits a weakness in the DDR3 design. What rowhammer does never happens in real life workloads, it's kind of a DDOS attack against your RAM. Also running it damages the RAM (overheats some row lines), so I would like to keep it to a minimum, not going to run it for long.

@heyjonathan There is no ECC support on H67/Q77 chipsets, I have only tested non-ECC DIMMs in many configurations.

@kernelOfTruth

This comment has been minimized.

Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Jan 28, 2016

@gkkovacs you did every test in JBOD configuration ?

does the VIA controller (VIA VT6415 controller ) exhibit this behavior ?
(specs: http://www.asus.com/Motherboards/P8H67V/specifications/ )

Could be what @drescherjm meant, but please nonetheless clarify on how the DIMMs got installed in relation to dual channel status,

and if applicable - test 3-DIMM configuration (not rowhammer, copying over data)

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 28, 2016

@kernelOfTruth I have tested with the H67 and Q77 on-board Intel ICH controller in AHCI mode, and with the Adaptec SAS RAID controller in the following modes:

  • simple JBOD mode (disks are passed through as is)
  • morphed JBOD volume mode (simple volumes are created then passed through)
  • RAID mode (created HW RAID10 volume with controller, formatted with ZFS / Btrfs / ext4)

All modes exhibited the checksum errors, although in HW RAID mode there were considerably less errors per same amount of data written (10x less with Btrfs, extremely rare with ext4).

I did not test the VIA controller, nor can I, since that motherboard has been replaced.

@drescherjm

This comment has been minimized.

Copy link

@drescherjm drescherjm commented Jan 28, 2016

Yes. I meant to test rowhammer with 2 DIMMs. At work and otherwise I have seen quite a few RAM problems over the years with all slots populated especially when using DIMMs that are higher density than the system initially supported.

@7Z0t99

This comment has been minimized.

Copy link

@7Z0t99 7Z0t99 commented Jan 28, 2016

I'm afraid I don't see why rowhammer is relevant here, e.g. what would a shorter or longer time until the first error tell us? I mean we know that probably any DDR3 module made on a small process geometry is susceptible to rowhammer and as a countermeasure some vendors updated their BIOSes to refresh the modules more often. Newer processors might have more and better countermeasures though.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 28, 2016

@7Z0t99 I agree

@drescherjm As I wrote above, passing or failing rowhammer is not indicative of DRAM stability or defects. It simply shows that DDR3 is vulnerable by design to a row overload. I'm not saying this issue is not a memory problem, but rowhammer results won't get us closer to solving it.

I have tried many speeds, sizes and manufacturers RAM modules during investigation of this issue, and I tried underclocking RAM as well to put much less strain on it, none of these helped.

@drescherjm

This comment has been minimized.

Copy link

@drescherjm drescherjm commented Jan 28, 2016

I saw that but I do not agree with the conclusion.

@7Z0t99

This comment has been minimized.

Copy link

@7Z0t99 7Z0t99 commented Jan 28, 2016

I'm not sure if this has been answered yet, but I would like to know if the errors get introduced during writing or reading. One way to test this would be to write the data on Linux and reboot into e.g. FreeBSD to do the scrub and vice versa. Or you move the disks between problematic machine and a known working one.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Jan 29, 2016

@7Z0t99 I'm pretty sure the errors get there during writing. Repeated scrubs turn up the same amount of errors on the same disks, at the same places.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Feb 2, 2016

@kernelOfTruth @7Z0t99 I have been testing the server in production for 4 days now with 2 DIMMs only (2x 8GB DDR3-1333 in dual channel mode), and it has been rock solid. Before putting the real workload back, I have written over 2TB of test data on it, and there was not a single checksum error. TBH I'm still baffled at all this, still no clue if that's a hardware or software error.

@7Z0t99

This comment has been minimized.

Copy link

@7Z0t99 7Z0t99 commented Feb 3, 2016

Well, since you tried so many different combinations of hardware, and since you say there are no errors when using FreeBSD, I am leaning more towards Software bug.
I can just reiterate that contacting the btrfs / kernel mailing list might be a good idea, since there is a lot more expertise about the internals of the linux kernel.

@ryao

This comment has been minimized.

Copy link
Member

@ryao ryao commented Feb 24, 2016

@gkkovacs The only time I have ever seen something as bizarre as what is described here was when my ATX case had a standoff making contact with the motherboard right behind the DIMM slots. Could something like that be happening here?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Mar 2, 2016

@ryao The thought has crossed my mind as well, and I remember testing the first (H67) motherboard outside the case to check for this. I haven't tested the second (Q77) board this way yet, but I have ordered an i7-3770 CPU (to see if this is a Sandy Bridge IMC problem or not), and when I replace it, I will certainly do more tests outside the case.

BTW the server is in production for a month now with 2 DIMMs (2x 8GB DDR3-1333), using ZFS on Adaptec JBOD, and there was not a single error during that time.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Mar 22, 2016

@ryao @kernelOfTruth @7Z0t99 @behlendorf

So I have tested the very same server with an i7-3770 (Ivy Bridge) CPU to eliminate Sandy Bridge from the mix. Needless to say that the ZFS checksum errors still happen in the newly created files.

Let's recap: on the hardware side two motherboards (H67 first, Q77 now), three processors (i7-2600 first, i5-2500K after, i7-3770 now), two power supplies, two SATA controllers (motherboard ICH, Adaptec PCIe), SATA cables (backplane, regular cables), PCIe GPU, different sets of disks and RAM modules were all tested. MB outside the case was tested again. On the software side: different ZFS versions (atm running the latest), btrfs, different kernels (2.6.32 and 4.2), and many kernel options were tested.

None of the above made any difference: when using 3 or 4 RAM modules (regardless of 12, 16, 24 or 32 GB), the system creates checksum errors in newly created files. With only 2 RAM modules installed, there are no errors.

This is starting to drive me mad, anyone has any idea remaining?

Should I buy an expensive, overclockable 4 identical piece kit of DDR3-1866 RAM? (Inexpensive 4 identical piece kit of DDR3-1333 RAM was already tested.)

@ghfields

This comment has been minimized.

Copy link
Contributor

@ghfields ghfields commented Mar 22, 2016

I am curious if you can indeed make the checksum errors occur with only two modules if you place them both in the same memory channel.

On your Intel motherboard, you usually install memory in matched colored slots (blacks, then blues). This occupies both memory channels evenly. Could you NOT do that and place one in the first black and another in the first blue? This will load them onto a single memory channel. This could help identify if it is related to total quantity of modules or quantity of modules per channel.

(Sorry if you have reported this already and I missed it in the previous 70 comments)

@ryao

This comment has been minimized.

Copy link
Member

@ryao ryao commented Mar 22, 2016

@gkkovacs This sounds somehow power related. The only thing that you do not appear to have tried is using a power line conditioner:

https://www.tripplite.com/products/power-conditioners~23

You have not stated that you use a UPS, although your average UPS model does not actually do anything for spikes and drops that last <10ms.

Before you go out and buy another piece of hardware, I would like some more information on the PSU and its replacement. What are their model numbers? What case do you have and how is it mounted inside the case? What do the temperature sensors inside your case read when the system is warm?

This is bizzare enough that I am thinking about how the combination of many different sub-optimal things might combine to cause what you describe. So far, I am thinking maybe you have a combination of bad power, excessive heat and a PSU model that is not designed to supply sufficient voltage on the 3.3V rail.

Lastly, if it is not too much trouble, would you post a list of the exact parts that you used as if you were telling me how to build a complete replica of your system?

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Mar 22, 2016

@ghfields Great idea! Will test the single memory channel setup tonight or tomorrow and report back.

@ryao The server is in a data center, so I believe power is top notch, and they provide UPS and generators for backup, also cooling is a constant 20 degrees IIRC. Cureent PSU supplies 24A on the 3.3V rail, previous PSU was a Chieftec 400W model (don't have specifics). GPU tested was a Sapphire Radeon HD6870.

Current hardware parts list below.

System

Drives
Other manufacturer's drives, and many combinations (RAID1, RAID0, RAID10, RAIDZ) were also tested. With and without SSD caching. Both Intel and Adaptec controller.

Memory
Most possible combinations were tested. Possibly other kits I have since forgotten.

@sanyo-ok

This comment has been minimized.

Copy link

@sanyo-ok sanyo-ok commented Sep 13, 2016

Cannot it be an electromagnetic issue with some type of a electromagnetic wave noise coming from power line or somewhere else?

Do you have a good grounding?

How much a meter like:
http://www.ebay.com/itm/Electromagnetic-Radiation-Tester-Detector-EMF-Meter-Dosimeter-Digital-No-Error-B-/391313457457?hash=item5b1c198131:g:GXEAAOSwGzlTtoMs

would indicate on its display?

It shall show about zero or small values like 1-10 for a well grounded computer

Otherwise it may show 1500-2000 and it can be a reason of badly working Gigabit Ethernet, PCI slots, USB, SATA, etc.

I guess it may lead to other less noticeable issues with lower values like 1000-1500, may be it is just your situation?

@ethchest

This comment has been minimized.

Copy link

@ethchest ethchest commented Aug 24, 2018

Was there ever any update/solution here? @gkkovacs
Personally I would have just invested in a board with ECC RAM, especially as you bought so much new stuff anyway.

@gkkovacs

This comment has been minimized.

Copy link
Author

@gkkovacs gkkovacs commented Aug 24, 2018

@ethchest The final conclusion was that the memory modules were either faulty or simply unstable (probably because of power delivery) in dual channel, quad DIMM configuration. Since then we have decommissioned this server and been only using dual Xeon motherboards with fully buffered ECC RAM.

@ethchest

This comment has been minimized.

Copy link

@ethchest ethchest commented Aug 27, 2018

Thanks for the reply/update!

@ghfields

This comment has been minimized.

Copy link
Contributor

@ghfields ghfields commented Feb 16, 2019

This issue can be closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
You can’t perform that action at this time.