Panic when running 'zpool split' #5565

Open
smaeul opened this Issue Jan 6, 2017 · 2 comments

Projects

None yet

2 participants

@smaeul
smaeul commented Jan 6, 2017

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version stable
Linux Kernel 4.7.10-hardened
Architecture amd64
ZFS Version 0.6.5.8-r0-gentoo
SPL Version 0.6.5.8-r0-gentoo

Describe the problem you're observing

I received this panic when trying to run zpool split system blah:

zed[4221]: eid=7 class=config.sync pool=system
zed[4223]: eid=8 class=statechange
zed[4240]: eid=9 class=config.sync pool=blah
kernel: VERIFY(size != 0) failed
kernel: PANIC at range_tree.c:172:range_tree_add()
kernel: Showing stack for process 1299
kernel: CPU: 0 PID: 1299 Comm: txg_sync Tainted: P           O    4.7.10-hardened #1
kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97E-ITX/ac, BIOS P2.00 10/12/2015
kernel:  0000000000000000 ffffffff814b6a3f 0000000000000007 ffffffffa058c3d1
kernel:  ffffc900097b3ba8 ffffffffa02bb8e6 00ff880417cc1700 ffffffff00000028
kernel:  ffffc900097b3bb8 ffffc900097b3b58 7328594649524556 30203d2120657a69
kernel: Call Trace:
kernel:  [<ffffffff814b6a3f>] ? dump_stack+0x47/0x68
kernel:  [<ffffffffa058c3d1>] ? _fini+0x152ec/0x4661a [zfs]
kernel:  [<ffffffffa02bb8e6>] ? spl_panic+0xb6/0xe0 [spl]
kernel:  [<ffffffffa04c7f7b>] ? arc_buf_thaw+0x7b/0xb0 [zfs]
kernel:  [<ffffffffa04d39df>] ? dbuf_dirty+0x48f/0x880 [zfs]
kernel:  [<ffffffff819693d9>] ? mutex_lock+0x9/0x30
kernel:  [<ffffffffa0577e68>] ? _fini+0xd83/0x4661a [zfs]
kernel:  [<ffffffffa04d95a0>] ? dmu_buf_rele_array.part.4+0x30/0x50 [zfs]
kernel:  [<ffffffff811ec899>] ? kfree+0x29/0x170
kernel:  [<ffffffff811ebc15>] ? __slab_free+0x95/0x260
kernel:  [<ffffffffa04d2b31>] ? dbuf_read+0x5c1/0x790 [zfs]
kernel:  [<ffffffff819693d9>] ? mutex_lock+0x9/0x30
kernel:  [<ffffffffa04d39df>] ? dbuf_dirty+0x48f/0x880 [zfs]
kernel:  [<ffffffffa059b4ae>] ? _fini+0x243c9/0x4661a [zfs]
kernel:  [<ffffffffa050b989>] ? range_tree_add+0x269/0x290 [zfs]
kernel:  [<ffffffffa02b6ccf>] ? spl_kmem_zalloc+0x8f/0x160 [spl]
kernel:  [<ffffffffa02b6ccf>] ? spl_kmem_zalloc+0x8f/0x160 [spl]
kernel:  [<ffffffffa050b720>] ? range_tree_destroy+0x60/0x60 [zfs]
kernel:  [<ffffffffa050be2d>] ? range_tree_walk+0x2d/0x50 [zfs]
kernel:  [<ffffffffa05287e9>] ? vdev_dtl_sync+0xd9/0x3a0 [zfs]
kernel:  [<ffffffffa0528b3d>] ? vdev_sync+0x8d/0x110 [zfs]
kernel:  [<ffffffffa051488f>] ? spa_sync+0x3bf/0xae0 [zfs]
kernel:  [<ffffffff81118a62>] ? autoremove_wake_function+0x22/0x40
kernel:  [<ffffffffa0524d09>] ? txg_sync_thread+0x3a9/0x5e0 [zfs]
kernel:  [<ffffffffa0524960>] ? txg_quiesce_thread+0x380/0x380 [zfs]
kernel:  [<ffffffffa02b8dc7>] ? thread_generic_wrapper+0x67/0x80 [spl]
kernel:  [<ffffffffa02b8d60>] ? __thread_exit+0x10/0x10 [spl]
kernel:  [<ffffffff810fa1b8>] ? kthread+0xb8/0xd0
kernel:  [<ffffffff8196b72e>] ? ret_from_fork+0x1e/0x50
kernel:  [<ffffffff810fa100>] ? kthread_worker_fn+0x180/0x180

The zpool split command, any further zfs/zpool commands, and sync all became uninterruptible (D state).

Describe how to reproduce the problem

I have the following pool I attempted to split. It is a mirror of two dm-crypt volumes. It is less than a year old. It was originally created with a single disk, and the second was attached several hours later. No further changes to the layout were made until now.

NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
system  3.62T  1.17T  2.46T         -    17%    32%  1.00x  ONLINE  -
  mirror  3.62T  1.17T  2.46T         -    17%    32%
    cryptb      -      -      -         -      -      -
    crypta      -      -      -         -      -      -

Include any warning/errors/backtraces from the system logs

See above.

@smaeul
smaeul commented Jan 6, 2017 edited

After forcibly rebooting, zpool status shows as below:

  pool: system
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 2h44m with 0 errors on Thu Jan  5 18:46:17 2017
config:

        NAME        STATE     READ WRITE CKSUM
        system      DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            cryptb  ONLINE       0     0     0
            crypta  OFFLINE      0     0     0

errors: No known data errors

And after zpool online

  pool: system
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub in progress since Fri Jan  6 06:51:56 2017
    28.3M scanned out of 1.17T at 2.02M/s, 168h0m to go
    0 repaired, 0.00% done
config:

        NAME        STATE     READ WRITE CKSUM
        system      DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            cryptb  ONLINE       0     0     0
            crypta  SPLIT        0     0     0  split into new pool

errors: No known data errors
@behlendorf
Member

@smaeul thanks for reporting this issue. It appears the failure was caused by a zero length entry in the dirty time log (DTL) for the mirror when the split was requested. The DTL is used to track the set of transaction groups for which the vdev has less than perfect replication. You should be able to recover full redundancy in the pool by performing a zpool replace as described. Then I'd suggest scrubbing the pool for good measure before retrying the zpool split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment