PANIC at dsl_deadlist.c:104:dsl_deadlist_open() for access on specific files/datasets after drive resilvering on Ubuntu 16.04 #5561

Open
pgroe opened this Issue Jan 4, 2017 · 3 comments

Projects

None yet

2 participants

@pgroe
pgroe commented Jan 4, 2017

Hi,

I have used stable zfsonlinux in productive environment since 2013 and never had an issue.
However, a week ago, i replaced one drive, having it resilvered from the raidz1.
Now specific tasks (writing to specific files, finishing scrub, ...) are causing a kernel panic and the user-space program wating for response gets stuck in the kernel call.

Beside reporting this bug, I would be glad on any hints to recover my pool as it's productive environment.

System information

Type | Low power server, 8 GB RAM, 7 drives, 5*2TB for zfs raidz1
Distribution Name | Ubuntu
Distribution Version | 16.04
Linux Kernel | 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:50:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Architecture | x86_64
ZFS Version | 0.6.5.6-0ubuntu15
SPL Version | 0.6.5.6-0ubuntu4

Describe the problem you're observing

Kernel panic on some specific tasks (guess: accessing a corrupted dataset).
Successfully reproduced:

  • Writing new large files (threshold is sth about 800MB)
  • Finishing pool scrub

Describe how to reproduce the problem

Not sure about that. What I did: replace a drive with unrelocatable sectors, resilver, write on pool with smb -> first kernel panic and stuck smbd processes for writing large file (about 4GB).

Include any warning/errors/backtraces from the system logs

$ dmesg

[  461.439155] perf interrupt took too long (10001 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[34073.907986] VERIFY3(0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf)) failed (0 == 2)
[34073.916243] PANIC at dsl_deadlist.c:104:dsl_deadlist_open()
[34073.922075] Showing stack for process 6946
[34073.922097] CPU: 2 PID: 6946 Comm: txg_sync Tainted: P           O    4.4.0-57-generic #78-Ubuntu
[34073.922102] Hardware name: Supermicro X7SPA-H /X7SPA-H , BIOS 1.2a       02/21/12
[34073.922108]  0000000000000286 0000000053965a89 ffff8800a244b988 ffffffff813f6f33
[34073.922128]  ffffffffc066c192 0000000000000068 ffff8800a244b998 ffffffffc03eece2
[34073.922137]  ffff8800a244bb20 ffffffffc03eedab ffff8800a94fe620 0000000000000030
[34073.922145] Call Trace:
[34073.922164]  [<ffffffff813f6f33>] dump_stack+0x63/0x90
[34073.922198]  [<ffffffffc03eece2>] spl_dumpstack+0x42/0x50 [spl]
[34073.922228]  [<ffffffffc03eedab>] spl_panic+0xbb/0xf0 [spl]
[34073.922360]  [<ffffffffc059e796>] ? dmu_buf_rele+0x36/0x40 [zfs]
[34073.922486]  [<ffffffffc059e7ae>] ? dbuf_rele+0xe/0x10 [zfs]
[34073.922593]  [<ffffffffc05ba32e>] ? dnode_hold_impl+0x2fe/0x4e0 [zfs]
[34073.922683]  [<ffffffffc05ba529>] ? dnode_hold+0x19/0x20 [zfs]
[34073.922774]  [<ffffffffc05a91c6>] ? dmu_bonus_hold+0x36/0x280 [zfs]
[34073.922897]  [<ffffffffc05c645b>] dsl_deadlist_open+0xfb/0x180 [zfs]
[34073.922938]  [<ffffffffc03e9b03>] ? spl_kmem_zalloc+0xa3/0x180 [spl]
[34073.923060]  [<ffffffffc05bfb3b>] dsl_dataset_hold_obj+0x1fb/0x890 [zfs]
[34073.923162]  [<ffffffffc05d6121>] dsl_scan_sync+0x421/0xc00 [zfs]
[34073.923271]  [<ffffffffc0645347>] ? zio_wait+0x157/0x1f0 [zfs]
[34073.923368]  [<ffffffffc05ea613>] spa_sync+0x3c3/0xb30 [zfs]
[34073.923380]  [<ffffffff810ac5d2>] ? default_wake_function+0x12/0x20
[34073.923479]  [<ffffffffc05fba4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[34073.923579]  [<ffffffffc05fb690>] ? txg_delay+0x180/0x180 [zfs]
[34073.923600]  [<ffffffffc03ebe31>] thread_generic_wrapper+0x71/0x80 [spl]
[34073.923616]  [<ffffffffc03ebdc0>] ? __thread_exit+0x20/0x20 [spl]
[34073.923623]  [<ffffffff810a0c08>] kthread+0xd8/0xf0
[34073.923631]  [<ffffffff810a0b30>] ? kthread_create_on_node+0x1e0/0x1e0
[34073.923642]  [<ffffffff8183788f>] ret_from_fork+0x3f/0x70
[34073.923648]  [<ffffffff810a0b30>] ? kthread_create_on_node+0x1e0/0x1e0
[34200.524344] INFO: task txg_sync:6946 blocked for more than 120 seconds.
[34200.531188]       Tainted: P           O    4.4.0-57-generic #78-Ubuntu
[34200.538067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[34200.546216] txg_sync        D ffff8800a244b980     0  6946      2 0x00000000
[34200.546228]  ffff8800a244b980 ffff8800a244bc80 ffff880236248cc0 ffff88020b443300
[34200.546236]  ffff8800a244c000 0000000000000068 ffffffffc065b7a0 ffffffffc066c368
[34200.546242]  ffff8800a244bc80 ffff8800a244b998 ffffffff818333f5 ffffffffc066c192
[34200.546249] Call Trace:
[34200.546265]  [<ffffffff818333f5>] schedule+0x35/0x80
[34200.546286]  [<ffffffffc03eedd0>] spl_panic+0xe0/0xf0 [spl]
[34200.546380]  [<ffffffffc059e796>] ? dmu_buf_rele+0x36/0x40 [zfs]
[34200.546464]  [<ffffffffc059e7ae>] ? dbuf_rele+0xe/0x10 [zfs]
[34200.546552]  [<ffffffffc05ba32e>] ? dnode_hold_impl+0x2fe/0x4e0 [zfs]
[34200.546640]  [<ffffffffc05ba529>] ? dnode_hold+0x19/0x20 [zfs]
[34200.546726]  [<ffffffffc05a91c6>] ? dmu_bonus_hold+0x36/0x280 [zfs]
[34200.546816]  [<ffffffffc05c645b>] dsl_deadlist_open+0xfb/0x180 [zfs]
[34200.546832]  [<ffffffffc03e9b03>] ? spl_kmem_zalloc+0xa3/0x180 [spl]
[34200.546922]  [<ffffffffc05bfb3b>] dsl_dataset_hold_obj+0x1fb/0x890 [zfs]
[34200.547015]  [<ffffffffc05d6121>] dsl_scan_sync+0x421/0xc00 [zfs]
[34200.547119]  [<ffffffffc0645347>] ? zio_wait+0x157/0x1f0 [zfs]
[34200.547215]  [<ffffffffc05ea613>] spa_sync+0x3c3/0xb30 [zfs]
[34200.547223]  [<ffffffff810ac5d2>] ? default_wake_function+0x12/0x20
[34200.547322]  [<ffffffffc05fba4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[34200.547421]  [<ffffffffc05fb690>] ? txg_delay+0x180/0x180 [zfs]
[34200.547438]  [<ffffffffc03ebe31>] thread_generic_wrapper+0x71/0x80 [spl]
[34200.547454]  [<ffffffffc03ebdc0>] ? __thread_exit+0x20/0x20 [spl]
[34200.547461]  [<ffffffff810a0c08>] kthread+0xd8/0xf0
[34200.547468]  [<ffffffff810a0b30>] ? kthread_create_on_node+0x1e0/0x1e0
[34200.547474]  [<ffffffff8183788f>] ret_from_fork+0x3f/0x70
[34200.547480]  [<ffffffff810a0b30>] ? kthread_create_on_node+0x1e0/0x1e0

NOTE: last panic is repeated every 2 minutes for 10 times.

$ zpool status

  pool: pool
 state: ONLINE
  scan: scrub in progress since Wed Jan  4 02:02:40 2017
    6,24T scanned out of 6,24T at 152M/s, (scan is slow, no estimated time)
    0 repaired, 100,01% done
config:

        NAME                                                   STATE     READ WRITE CKSUM
        pool                                                   ONLINE       0     0     0
          raidz1-0                                             ONLINE       0     0     0
            ata-WDC_WD20EFRX-...-part16    ONLINE       0     0     0
            ata-Hitachi_HDS5C3020ALA632_...-part16  ONLINE       0     0     0
            ata-Hitachi_HDS5C3020ALA632_...-part16  ONLINE       0     0     0
            ata-Hitachi_HDS5C3020ALA632_...-part16  ONLINE       0     0     0
            ata-Hitachi_HDS5C3020ALA632_...-part16  ONLINE       0     0     0

errors: No known data errors

Best Regards,
Patrick

@pgroe
pgroe commented Jan 4, 2017 edited

Just tried $ zfs list -t all

[44124.972929] VERIFY3(0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf)) failed (0 == 2)
[44124.981151] PANIC at dsl_deadlist.c:104:dsl_deadlist_open()
[44124.986967] Showing stack for process 2784
[44124.986977] CPU: 3 PID: 2784 Comm: zfs Tainted: P           O    4.4.0-57-generic #78-Ubuntu
[44124.986983] Hardware name: Supermicro X7SPA-H /X7SPA-H , BIOS 1.2a       02/21/12
[44124.986989]  0000000000000286 00000000ae7596fe ffff88013b527a68 ffffffff813f6f33
[44124.986999]  ffffffffc066c192 0000000000000068 ffff88013b527a78 ffffffffc03eece2
[44124.987007]  ffff88013b527c00 ffffffffc03eedab ffff8800a94fe620 0000000000000030
[44124.987016] Call Trace:
[44124.987035]  [<ffffffff813f6f33>] dump_stack+0x63/0x90
[44124.987069]  [<ffffffffc03eece2>] spl_dumpstack+0x42/0x50 [spl]
[44124.987098]  [<ffffffffc03eedab>] spl_panic+0xbb/0xf0 [spl]
[44124.987224]  [<ffffffffc059e796>] ? dmu_buf_rele+0x36/0x40 [zfs]
[44124.987350]  [<ffffffffc059e7ae>] ? dbuf_rele+0xe/0x10 [zfs]
[44124.987480]  [<ffffffffc05ba32e>] ? dnode_hold_impl+0x2fe/0x4e0 [zfs]
[44124.987618]  [<ffffffffc05ba529>] ? dnode_hold+0x19/0x20 [zfs]
[44124.987743]  [<ffffffffc05a91c6>] ? dmu_bonus_hold+0x36/0x280 [zfs]
[44124.987877]  [<ffffffffc05c645b>] dsl_deadlist_open+0xfb/0x180 [zfs]
[44124.987906]  [<ffffffffc03e9b03>] ? spl_kmem_zalloc+0xa3/0x180 [spl]
[44124.988042]  [<ffffffffc05bfb3b>] dsl_dataset_hold_obj+0x1fb/0x890 [zfs]
[44124.988185]  [<ffffffffc05c098b>] dsl_dataset_stats+0x24b/0x2b0 [zfs]
[44124.988325]  [<ffffffffc05ad30f>] dmu_objset_stats+0x1f/0x60 [zfs]
[44124.988462]  [<ffffffffc0622f15>] zfs_ioc_objset_stats_impl.part.18+0x65/0xe0 [zfs]
[44124.988604]  [<ffffffffc062315d>] zfs_ioc_snapshot_list_next+0x1cd/0x1e0 [zfs]
[44124.988740]  [<ffffffffc062370b>] zfsdev_ioctl+0x44b/0x4e0 [zfs]
[44124.988757]  [<ffffffff8122224f>] do_vfs_ioctl+0x29f/0x490
[44124.988770]  [<ffffffff8106b514>] ? __do_page_fault+0x1b4/0x400
[44124.988780]  [<ffffffff812224b9>] SyS_ioctl+0x79/0x90
[44124.988791]  [<ffffffff818374f2>] entry_SYSCALL_64_fastpath+0x16/0x71

The user-process is stuck.

@pgroe pgroe changed the title from Kernel Panic for access on specific files/datasets after drive resilvering on Ubuntu 16.04 to PANIC at dsl_deadlist.c:104:dsl_deadlist_open() for access on specific files/datasets after drive resilvering on Ubuntu 16.04 Jan 4, 2017
@behlendorf
Member

@pgroe thank you for reporting this issue. Unfortunately I don't have an easy answer for you, this appears to the first time a problem like this has been reported.

For background the VERIFY indicates that ZFS couldn't open the deadlist for a filesystem and halted the user-process. Depending on what locks that thread may have been holding this can result in other hung processes. The deadlist is where ZFS tracks objects which have been removed from the filesystem but which still exist in a snapshot. It should never be missing like the error code in this case indicates (2 == ENOENT), that suggests some kind of corruption has occurred.

Can you run zdb -ddddd <pool>/<filesystem> to dump the deadlists from the offending filesystem. You should be able to identify the problematic filesystem by mounting them individually until the issue reproduces. Alternates you can dump zdb -ddddd <pool> to dump all of them. This will generate a huge amount of output so you'll want to redirect it to a file and then search for the 'Deadlist' section. This should give us a better idea of the exact issue.

You should be able to copy the data out of the pool by importing it read-only. However you'll want to be careful not to run any of the various zfs commands which will result in the deadlist being accessed. Simply mount the filesystem and copy the data.

@pgroe
pgroe commented Jan 13, 2017 edited

I attached some new disks for backing up the data. After rebooting, the boot process was stuck at the message

a start job is running for import zfs pools by cache file

After deleting the zpool.cache file using a rescue system, the system booted correctly, but (of course) the pool was missing.
Now, the (exact same) panic occurs at the pool import

  • Is there any chance to backup the data without importing the pool?
  • Is there any chance to generate the debug data, you require to analyze the problem (without importing the pool)?

Additionally I checked to import the pool in a degraded state. For 3 of 5 disks it says

cannot import pool: 'I/O error'

For the other 2 of 5 disks, the same panic occurs.

EDIT:
Importing pool using

zpool import -o readonly=on pool

works fine. I will generate the zdb Deadlist output as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment