ZFS fails to import raidz2 pool in degraded state with only 1 faulted disk and no way to issue replace command to recover pool #5567

Open
djdarcy opened this Issue Jan 7, 2017 · 1 comment

Projects

None yet

2 participants

@djdarcy
djdarcy commented Jan 7, 2017 edited

System information

Type Version/Name
Config File server with 4GBs RAM, 6x3TB drives in ZFS raidz2 config
Distribution Name Gentoo
Distribution Version Base release 2.2 (updated to latest /w glibc 2.23-r3)
Linux Kernel 3.15.6-aufs-r1 #1 SMP Thu Aug 7 15:26:08 UTC 2014 x86_64 Dual-Core AMD Opteron(tm) Processor 2214 HE AuthenticAMD GNU/Linux (also tried: 4.5.2-aufs-r1 Jul 3 2016)
Architecture x86_64
ZFS Version 0.6.5.3-r0-gentoo (also tried: 0.6.5-329_g5c27b29)
SPL Version 0.6.5.3-r0-gentoo (also tried: 0.6.5-63_g5ad98ad)

Describe the problem you're observing

Four days ago my raidz2 pool entered a degraded state with one faulting device

pool: gutenberg
id: 6853368280468254591
 state: DEGRADED
 status: One or more devices are faulted.
 action: The pool can be imported despite missing or damaged devices.  The
	 fault tolerance of the pool may be compromised if imported.
 config:

	gutenberg                DEGRADED
	  raidz2-0               DEGRADED
	    sda                  ONLINE
	    2761285860633155577  FAULTED  too many errors
	    sdc                  ONLINE
	    sdd                  ONLINE
	    sde                  ONLINE
	    sdf                  ONLINE

Initially when I issued zpool import -d /dev/; zpool import -f gutenberg the pool imported just fine. While waiting for the new drive to arrive, to be on the safe side I started copying all of the data over to a secondary array.

Then the day before yesterday out of the blue for some reason the system hard-locked while the drives were resilvering. This system has had years of uptime in the past with no problems. After rebooting doing a zpool import (even an import -f -N -o readonly=on) caused the entire system to become unresponsive. The last message in /var/log/messages doesn't tell me much (related to /dev/fd0 which I since disconnected in the event there was some spurious interrupt), though perhaps I'm not looking in the right place.

Smartctl shows all drives are passing though /dev/sda looks like it's starting to have some troubles:
http://sprunge.us/SdcV

I've also tried booting a new Gentoo LiveDVD (livedvd-amd64-multilib-20160704.iso) with newer assets to see if it could load the pool, but still nothing.

Running off the LiveDVD ZFS starts to resilver sda, but it won't let me mount the volume and then after about 10 minutes the entire machine locks. I ran memtest on the machine. It has ECC RAM and everything comes back 100%. At this point with no ZFS logs/coredumps I am at an impasse.

I've run:

zdb -e -p /dev/disk/by-id gutenberg

After about 20 or so minutes zdb writes to the console :

loading space map for vdev 0 of 1, metaslab 129 of 130
1.56T completed (141MB/s) estimated time remaining: 30hr 45min

Output of the first 560 lines: pastebin.com

Past the ZIL Header the output is 100s of MBs of ZFS directory ..., ZFS file ..., etc.

Next I'll try the command below to get the counts of each intent log transaction type.

zdb -e -p /dev/disk/by-id -ii gutenberg

I'm hesitant to run modprobe zfs zfs_recover=1 since the comments in spa_misc.c say "zfs_panic_recover() will turn into warning messages. This should only be used as a last resort, as it typically results in leaked space, or worse."

Whatever's going on, this is critical. Since the pool won't even load I can't even do a zfs replace to swap out the bad drive with the replacement.

I would be happy at this point if I could just mount the volume and copy over the contents as long as there's some assurance this won't corrupt anything.

Describe how to reproduce the problem

Running zpool import or any variation therein will freeze the system.

Include any warning/errors/backtraces from the system logs

zdb -e -p /dev/disk/by-id gutenberg
Output of the first 560 lines: pastebin.com

@behlendorf
Member

@djdarcy based on the zdb output you provided the pool appears to be good health. This is particularly encouraging because zdb effectively does a read-only import of the pool but in user space. So the kernel should also be able to import it.

When the system become unresponsive are you able to issue any commands or get to the console? The output from dmesg or any messages logged to the console might help explain the issue.

One possibly is that your failing drive sdb / 2761285860633155577 hasn't completely failed and is still semi-functional. Functional enough that zpool import detects it and tried to use it when importing the pool. Very slow IOs to the device which don't actually fail could slow the entire import process to a crawl and depending on the exact hardware cause what appears to be a non-responsive system.

My suggestion would be to physically remove the suspect drive from the system and try to import the pool read-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment