PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

phik · 2017-07-27T15:34:43Z

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.3.1611
Linux Kernel	3.10.0-514.6.1.el7.x86_64
Architecture	intel x64
ZFS Version	0.7.0-rc5
SPL Version	0.7.0-rc5

Describe the problem you're observing

This particular ZFS filesystem is the backing store for a Lustre MDS. After a Lustre LBUG:

2017-07-26T07:00:11-05:00 hmet0004 kernel: [36445.672990] LustreError: 28630:0:(osd_object.c:1284:__osd_object_create()) ASSERTION( rc == 0 ) failed: sa_buf_hold 18208548 failed: -2
2017-07-26T07:00:11-05:00 hmet0004 kernel: [36445.688811] LustreError: 28630:0:(osd_object.c:1284:__osd_object_create()) LBUG

...and subsequent reboot, any attempt to import the pool now gives this panic:

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0
Showing stack for process 31721
CPU: 3 PID: 31721 Comm: zpool Tainted: P           OE  ------------   3.10.0-514.6.1.el7.x86_64 #1
Hardware name: GIGABYTE D120-C21/MC20-S10, BIOS F02 05/19/2016
 0000000000000003 000000009d6dbe62 ffff881ec47bb1a0 ffffffff816862ac
 ffff881ec47bb1b0 ffffffffa02be234 ffff881ec47bb2d8 ffffffffa02be39c
 0000000000100000 61207274706b6c62 3838666666662074 3431303638316631
Call Trace:
 [<ffffffff816862ac>] dump_stack+0x19/0x1b
 [<ffffffffa02be234>] spl_dumpstack+0x44/0x50 [spl]
 [<ffffffffa02be39c>] vcmn_err+0x6c/0x110 [spl]
 [<ffffffffa03fefb8>] ? vdev_raidz_io_start+0x128/0x320 [zfs]
 [<ffffffffa03fd6c0>] ? vdev_raidz_asize+0x60/0x60 [zfs]
 [<ffffffff811de0c5>] ? kmem_cache_alloc+0x35/0x1e0
 [<ffffffffa02ba319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
 [<ffffffffa02ba319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
 [<ffffffffa03eb079>] zfs_panic_recover+0x69/0x90 [zfs]
 [<ffffffffa03838c0>] ? arc_read+0xa80/0xa80 [zfs]
 [<ffffffffa0442011>] zfs_blkptr_verify+0x311/0x370 [zfs]
 [<ffffffffa04420a4>] zio_read+0x34/0xe0 [zfs]
 [<ffffffff811dd4c1>] ? __kmalloc_node+0x1d1/0x2b0
 [<ffffffffa02b91a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
 [<ffffffffa02b91a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
 [<ffffffffa03838c0>] ? arc_read+0xa80/0xa80 [zfs]
 [<ffffffffa038343f>] arc_read+0x5ff/0xa80 [zfs]
 [<ffffffffa03a5873>] traverse_prefetch_metadata+0xd3/0xe0 [zfs]
 [<ffffffffa03a58f6>] prefetch_dnode_metadata+0x76/0xf0 [zfs]
 [<ffffffffa03a6355>] traverse_visitbp+0x665/0x840 [zfs]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa03a6bb1>] traverse_dnode+0xb1/0x1f0 [zfs]
 [<ffffffffa03a6475>] traverse_visitbp+0x785/0x840 [zfs]
 [<ffffffffa03a6701>] traverse_impl+0x1d1/0x450 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03a6d45>] traverse_dataset_resume+0x55/0x60 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03a6f0b>] traverse_pool+0x19b/0x1d0 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03e0643>] spa_load+0x1983/0x1f60 [zfs]
 [<ffffffffa03e0c7e>] spa_load_best+0x5e/0x290 [zfs]
 [<ffffffffa03e285a>] spa_import+0x1ca/0x6d0 [zfs]
 [<ffffffffa041b404>] zfs_ioc_pool_import+0xf4/0x130 [zfs]
 [<ffffffffa041f346>] zfsdev_ioctl+0x506/0x550 [zfs]
 [<ffffffff81212025>] do_vfs_ioctl+0x2d5/0x4b0
 [<ffffffff81691db1>] ? __do_page_fault+0x171/0x450
 [<ffffffff812122a1>] SyS_ioctl+0xa1/0xc0
 [<ffffffff816968c9>] system_call_fastpath+0x16/0x1b

We've tried every suggested workaround we could find in other issues and mailing list threads, to no avail. Including:

swapping the disks into a different chassis (because hope springs eternal in the human breast, I suppose) -- panic
importing the pool with just a single disk -- panic
import -FXn and just import -FX -- panic
the same but with zfs_recover=1 -- panic
listing the uberblocks with zdb -ul, getting the txg for the last uberblock (by date), and trying a zpool import -T txg -o readonly. This at least did something, which was to read the disks at 2 MB/s. Unfortunately at that rate we were looking at 23 days to read the whole thing, so we gave up after 4 hours.

At one point -- I'm not sure if it was import -FX or zfs_recover=1 -- something changed the nature of the failure from an instant panic at the first "invalid CHECKSUM" error into a series of warnings followed by an assertion:

WARNING: blkptr at ffff881ebe6d5440 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5440 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5440 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d5440 DVA 1 has invalid VDEV 959721836
WARNING: blkptr at ffff881ebe6d5440 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d5740 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5740 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5840 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5840 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5840 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d5840 DVA 1 has invalid VDEV 926298476
WARNING: blkptr at ffff881ebe6d5840 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d5b40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5b40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6140 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6140 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6240 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6240 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6240 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6240 DVA 1 has invalid VDEV 825635180
WARNING: blkptr at ffff881ebe6d6240 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d6540 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6540 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6640 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6640 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6640 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6640 DVA 1 has invalid VDEV 809054572
WARNING: blkptr at ffff881ebe6d6640 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d6940 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6940 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6a40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6a40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6a40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6a40 DVA 1 has invalid VDEV 825375084
WARNING: blkptr at ffff881ebe6d6a40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d6d40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6d40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6e40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6e40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6e40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6e40 DVA 1 has invalid VDEV 925971052
WARNING: blkptr at ffff881ebe6d6e40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d7140 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7140 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7240 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7240 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7240 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7240 DVA 1 has invalid VDEV 875573612
WARNING: blkptr at ffff881ebe6d7240 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d7540 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7540 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7640 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7640 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7640 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7640 DVA 1 has invalid VDEV 876163436
WARNING: blkptr at ffff881ebe6d7640 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d7940 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7940 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7a40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7a40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7a40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7a40 DVA 1 has invalid VDEV 875837548
WARNING: blkptr at ffff881ebe6d7a40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d7d40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7d40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7e40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7e40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7e40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7e40 DVA 1 has invalid VDEV 925905516
WARNING: blkptr at ffff881ebe6d7e40 DVA 2 has invalid VDEV 775303544
VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768)
PANIC at zio.c:266:zio_buf_alloc()

We get the same set of warnings from zdb -AAA -F -e:

Configuration for import:
        vdev_children: 1
        version: 5000
        pool_guid: 15267167644730928330
        name: 'lustre'
        state: 0
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15267167644730928330
            children[0]:
                type: 'raidz'
                id: 0
                guid: 12487255120635568856
                nparity: 3
                metaslab_array: 71
                metaslab_shift: 38
                ashift: 12
                asize: 40003264577536
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 11152527491519449036
                    devid: 'ata-HGST_HUH721010ALE600_7PH35SUC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221100000000-lun-0'
                    whole_disk: 1
                    DTL: 204
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH35SUC1'
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 12217561655882834813
                    devid: 'ata-HGST_HUH721010ALE600_7PH36B6C-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221103000000-lun-0'
                    whole_disk: 1
                    DTL: 203
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH36B6C1'
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 14383882701081049619
                    devid: 'ata-HGST_HUH721010ALE600_7PH377MC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221101000000-lun-0'
                    whole_disk: 1
                    DTL: 202
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH377MC1'
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 11902102953401244559
                    devid: 'ata-HGST_HUH721010ALE600_7PH395GC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221102000000-lun-0'
                    whole_disk: 1
                    DTL: 201
                    create_txg: 4
                    offline: 1
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH395GC1'
        rewind-policy:
            rewind-request-txg: 18446744073709551615
            rewind-request: 8
(all of the above warnings)
c < (1ULL << 24) >> 9 (0x7fffffffffffff < 0x8000)
ASSERT at ../../module/zfs/zio.c:266:zio_buf_alloc()Aborted

I guess there are two paths forward from here:

A fairly benign kernel panic in a largely-unrelated component somehow leads to an unusable file system? This is terrifying!
Having reached this stage, is there anything at all that can be done to recover from it? It seems like the data must all be there -- after all, the system went from fully operational to unusable in an instant -- we just can't get to it.

I'd be eternally grateful for any advice!

The text was updated successfully, but these errors were encountered:

phik · 2017-07-27T19:42:22Z

We've used the (truly terrifying) zfs_revert script (from https://gist.github.com/jshoward/5685757). Removing 1 transaction was not sufficient -- still got the panic -- but after removing five transactions, we can mount again!

The filesystem seems to be functional at that point -- we're able to create 100,000 files and write a GB of data -- and we've brought Lustre back up.

We still have the original dd copies of the raw corrupted disks, so we can still try to learn what's happened here and how to "properly" recover.

behlendorf · 2017-07-28T19:06:22Z

That is a scary script! I'm glad you were able to use it to rollback to a point where you could import the pool and mount the filesystems. This is exactly the sort of thing zpool import -FXn should have been able to do, but clearly couldn't because of the panic you're seeing.

As for the issue itself we saw something very similar to this on a test system while testing out Lustre 2.10 prior to it being officially released. Based on the pool damage we observed, our best guess as to what must have happened is that somehow Lustre overwrote the some memory containing block pointers due, valid checksums were then generated for the garbage block pointers and that was written to disk.

This is why the zpool import -FXn fails with a panic. ZFS trusts the contents of the blocks because the checksum they were written with is valid. One of the planned development items is to add some additional sanity checks to the ZFS code to verify certain data structures read from disk before using them. The damaged pool you saved would be a great test case for this.

To my knowledge we haven't had any reports of this sort of thing happening when using ZFS without Lustre. Perhaps @adilger knows if any issues like this were recently we resolved in Lustre?

phik · 2017-07-28T21:21:57Z

When I went looking for this error in the code:

WARNING: blkptr at ffff881ebe6d5440 has invalid CHECKSUM 0

...what I found suggested that the checksum "type" is an invalid value, and that it therefore didn't even know which algorithm to use to validate the checksum?

Or have I misinterpreted?

behlendorf · 2017-07-28T21:28:50Z

You've got it exactly right. The other warnings you reported show variations on similar forms of damage (unknown compression types, impossible DVAs, etc). All of which suggest the block pointer somehow got stomped on.

phik · 2017-07-28T21:30:50Z

Then I don't know how to square that with your previous statement:

This is why the zpool import -FXn fails with a panic. ZFS trusts the contents of the blocks because the checksum they were written with is valid.

How can the block checksum be valid, if it doesn't know which algorithm to use to validate?

behlendorf · 2017-07-28T21:35:21Z

Exactly! ZFS inherently trusts that if the checksum for a block is good, which it was, the contents of that block will be sane. Somehow blocks were written to the pool with valid checksums and nonsense values. That should be impossible, the only plausible way that can happen I'm aware of is due to memory corruption. The block pointers blocks gets damaged in memory after being created but before they're checksummed and written.

phik · 2017-07-28T21:40:47Z

I'm still confused about how it can know that the checksum is valid if it doesn't even know which checksum algorithm was used -- but I trust you. :)

behlendorf · 2017-07-28T21:46:37Z

The checksum for the damaged block and it's type are stored in the parent. The block pointer it's complaining about is for one of its children. :)

phik · 2017-07-28T21:47:36Z

Ah ha! You've drawn the veil from my eyes. I'm now very slightly less ignorant. Thank you!

sdm900 · 2017-07-29T07:32:09Z

Thanks for the insight. How do you want the pool? Its currently 3 x 10T raw disk images (dd of the raw device). I suspect only ~2.1T of each disk is really in use.

sdm900 · 2017-08-01T03:17:15Z

I've just read this again and wondered why, when this corruption is identified, isn't the previous txg used? Continually dropping back until a sane txg is found? Surely almost any issue should either cause the FS not to mount and allow the user to take alternative action, or drop back (as requested by the user) to a sane block?

Lalufu · 2017-08-01T07:45:10Z

My guess would be that noone anticipated that this could happen, and the code to handle it isn't present.

behlendorf · 2017-08-01T23:28:15Z

@phik we think we've identified the root cause here and opened #6439 with a proposed fix for review. My suggestion for the moment is to set dnodesize=legacy on your MDS for now as a workaround until we finalize the fix and get it merged.

Continually dropping back until a sane txg is found

Yes, this is exactly how the zpool import -Fx logic is implemented. The trouble here is that the block has a valid checksum which is what ZFS uses to verify it should trust something read from disk. As mentioned above we're looking in to adding addition layers of sanity checking to more gracefully handle situations like this one which shouldn't be possible unless there's a bug. And we think we found that bug.

phik · 2017-08-01T23:40:59Z

Innnteresting. I definitely don't understand the internals well enough to comment on the patch, but your description would certainly seem to fit the symptom.

Thanks very much for the prompt followup! I'll discuss your suggestion with the team ASAP.

sdm900 · 2017-08-07T01:54:42Z

FWIW we just got this panic again on a 0.7.0-rc3 system

[375528.898326] PANIC: blkptr at ffff881785cfc440 has invalid CHECKSUM 0
[375528.906484] Showing stack for process 221
[375528.912237] CPU: 3 PID: 221 Comm: spl_system_task Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
[375528.924617] Hardware name: GIGABYTE D120-C21/MC20-S10, BIOS F02 05/19/2016
[375528.933127] 0000000000000003 000000003bbe38b8 ffff881fe100f3a0 ffffffff816862ac
[375528.942221] ffff881fe100f3b0 ffffffffa02be234 ffff881fe100f4d8 ffffffffa02be39c
[375528.951631] 0000000000100000 61207274706b6c62 3838666666662074 3463666335383731
[375528.960947] Call Trace:
[375528.965138] [] dump_stack+0x19/0x1b
[375528.971914] [] spl_dumpstack+0x44/0x50 [spl]
[375528.979440] [] vcmn_err+0x6c/0x110 [spl]
[375528.986601] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375528.995011] [] ? kmem_cache_alloc+0x193/0x1e0
[375529.002551] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375529.010875] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375529.019175] [] zfs_panic_recover+0x69/0x90 [zfs]
[375529.026943] [] ? arc_read+0xa30/0xa30 [zfs]
[375529.034226] [] zfs_blkptr_verify+0x311/0x370 [zfs]
[375529.042129] [] zio_read+0x34/0xe0 [zfs]
[375529.049037] [] ? __kmalloc_node+0x1d1/0x2b0
[375529.056289] [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[375529.064032] [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[375529.072051] [] ? arc_read+0xa30/0xa30 [zfs]
[375529.079230] [] arc_read+0x5ff/0xa30 [zfs]
[375529.086195] [] traverse_prefetcher+0x14c/0x180 [zfs]
[375529.094123] [] traverse_visitbp+0x4e9/0x840 [zfs]
[375529.101733] [] traverse_dnode+0xb1/0x1f0 [zfs]
[375529.109096] [] traverse_visitbp+0x6bb/0x840 [zfs]
[375529.116641] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.124188] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.131715] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.139194] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.146601] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.154036] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.161398] [] traverse_dnode+0xb1/0x1f0 [zfs]
[375529.168496] [] traverse_visitbp+0x785/0x840 [zfs]
[375529.175806] [] traverse_prefetch_thread+0x108/0x180 [zfs]
[375529.183814] [] ? prefetch_needed.isra.5+0x40/0x40 [zfs]
[375529.191620] [] taskq_thread+0x246/0x470 [spl]
[375529.198513] [] ? wake_up_state+0x20/0x20
[375529.204969] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[375529.212370] [] kthread+0xcf/0xe0
[375529.218115] [] ? kthread_create_on_node+0x140/0x140
[375529.225444] [] ret_from_fork+0x58/0x90
[375529.231689] [] ? kthread_create_on_node+0x140/0x140

the machine was still up and running... hopefully comes back painlessly after reboot...

sdm900 · 2017-08-07T03:13:53Z

Rebooted just fine.

behlendorf · 2017-08-07T16:03:43Z

@sdm900 the exact issue reported here was introduced in 0.7.0-rc5 and will be fixed in 0.7.1.

sdm900 · 2017-08-07T23:58:19Z

That concerns me a little.

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439

behlendorf · 2017-08-09T16:18:28Z

This issue was resolved in the v0.7.1 release tagged on August 8th.

phik · 2017-08-10T06:24:44Z

I have a few lingering questions, if I may...

Out of curiosity, how reliable is that test that you committed? Did that reproduce the problem 100% of the time on rc5?
Stu says that we had this issue again on an rc3 system (in his comment of Aug 6). How can that be, if the problem was introduced in rc5?
We still have the fact that zpool import -Fx couldn't import this filesystem -- should this be filed separately?
There's also the (separate?) issue that zpool import -T couldn't do in finite time what the horrifying python script did very quickly. import -T seemed to be reading the entire filesystem at 2 MB/s. Should this be filed separately?

behlendorf · 2017-08-10T15:26:36Z

Out of curiosity, how reliable is that test that you committed?

It reproduced the issue 100% of the time in about 5 seconds. After applying the fix I left it running in a loop for about 8 hours without issue. We were able to reproduce the issue on one of our Lustre test systems although it took longer to reproduce, after applying the fix a week ago we haven't seen it since.

Stu says that we had this issue again on an rc3 system (in his comment of Aug 6). How can that be, if the problem was introduced in rc5?

I agree, that is very troubling. The PANIC only indicates that a damaged block pointer was encountered, it doesn't reveal how it ended up damaged. The issue fixed here was caused by two dnodes stomping on each other and is understood, unfortunately the root cause for @smd900's PANIC must be different.
@sdm900 if possible it would be great if you could scrub the pool and see if anything is reported.

We still have the fact that zpool import -Fx couldn't import this filesystem -- should this be filed separately?

Yes. We've talked about doing this development work for quite a while but I don't think we've ever opened an issue for it. It's a fairly large development effort.

There's also the (separate?) issue that zpool import -T couldn't do in finite time what the horrifying python script did very quickly. import -T seemed to be reading the entire filesystem at 2 MB/s. Should this be filed separately?

Yes, it would be great to open an issue for that too. We're going to need to investigate where exactly zpool import -T end up spending all that time.

sdm900 · 2017-08-11T01:33:00Z

I'd love to scrub the file system (and I'll set it going)... but I don't think it will ever finish. It is on a production file system that is heavily used and we will no doubt loose power before it finishes :(

sdm900 · 2017-08-14T01:36:59Z

As far as I can tell, zfs scrub is useless...

  scan: scrub in progress since Sat Aug 12 20:01:57 2017  
        540G scanned out of 6.32T at 6.26M/s, 269h20m to go
        0 repaired, 8.35% done

and this file system has gone quiet over the weekend. I expect the rate to drop to almost 0 when people are in the office.

behlendorf · 2017-08-14T16:08:44Z

@sdm900 for filesystems with lots of small files or heavy fragmentation I agree it can be unreasonably slow. There are patches in #6256 under development to improve the situation. But it sounds like you may want to just cancel, or pause, the scrub for now if it's not going to complete in a reasonable amount of time.

Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5396 Closes #6522 Closes #6414 Closes #6564

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439

dioni21 · 2018-07-17T19:05:58Z

@behlendorf Sorry to ask here, but...

Is there currently any tool to fix the invalid CHECKSUM panic?

I'm not sure, but the error could have been caused by something related to the kernel bug reported at #7723 (https://bugzilla.redhat.com/show_bug.cgi?id=1598462)

Sadly, the host panicked a few seconds after loading zfs driver, could never ever try to run zfs import -F. Since I had two disks, isolated one for later post-mortem and restarted send/recv from zero to a single disk. That's when I found the kernel bug mentioned above. 😭

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439

Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR openzfs#6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522 Closes openzfs#6414 Closes openzfs#6564

behlendorf mentioned this issue Aug 1, 2017

Fix allocation race #6439

Merged

13 tasks

behlendorf added this to TODO in 0.7.1 Aug 1, 2017

behlendorf removed this from TODO in 0.7.1 Aug 5, 2017

lnicola mentioned this issue Aug 5, 2017

Invalid backup stream #6366

Closed

behlendorf closed this as completed in 9631681 Aug 8, 2017

dinatale2 mentioned this issue Aug 8, 2017

Panic - PANIC at zfs_vfsops.c:583:zfs_space_delta_cb() #6474

Closed

filip-paczynski mentioned this issue Aug 9, 2017

panic: allocating allocated segment #6315

Closed

This was referenced Aug 11, 2017

zpool import -F can't recover from a specific type of corrupted block #6496

Open

zpool import -T attempts to read the (entire?) filesystem at 2 MB/s #6497

Closed

behlendorf mentioned this issue Aug 14, 2017

PANIC at sa.c:1283:sa_build_index() #6500

Closed

phik mentioned this issue Oct 2, 2017

Corrupted directory; scrub reports no errors. #6705

Closed

This was referenced Jun 4, 2018

Kernel panic importing zpool #7517

Closed

Panic on import: VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768) #7603

Open

dioni21 mentioned this issue Jul 21, 2018

zfs incremental recv fails with 'destination already exists' and corrupts pool #7735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

phik commented Jul 27, 2017

phik commented Jul 27, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017 •

edited

phik commented Jul 28, 2017

sdm900 commented Jul 29, 2017

sdm900 commented Aug 1, 2017

Lalufu commented Aug 1, 2017

behlendorf commented Aug 1, 2017

phik commented Aug 1, 2017

sdm900 commented Aug 7, 2017 •

edited

sdm900 commented Aug 7, 2017

behlendorf commented Aug 7, 2017

sdm900 commented Aug 7, 2017

behlendorf commented Aug 9, 2017

phik commented Aug 10, 2017

behlendorf commented Aug 10, 2017

sdm900 commented Aug 11, 2017

sdm900 commented Aug 14, 2017

behlendorf commented Aug 14, 2017

dioni21 commented Jul 17, 2018

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

Comments

phik commented Jul 27, 2017

Describe the problem you're observing

phik commented Jul 27, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017

phik commented Jul 28, 2017

behlendorf commented Jul 28, 2017 • edited

phik commented Jul 28, 2017

sdm900 commented Jul 29, 2017

sdm900 commented Aug 1, 2017

Lalufu commented Aug 1, 2017

behlendorf commented Aug 1, 2017

phik commented Aug 1, 2017

sdm900 commented Aug 7, 2017 • edited

sdm900 commented Aug 7, 2017

behlendorf commented Aug 7, 2017

sdm900 commented Aug 7, 2017

behlendorf commented Aug 9, 2017

phik commented Aug 10, 2017

behlendorf commented Aug 10, 2017

sdm900 commented Aug 11, 2017

sdm900 commented Aug 14, 2017

behlendorf commented Aug 14, 2017

dioni21 commented Jul 17, 2018

behlendorf commented Jul 28, 2017 •

edited

sdm900 commented Aug 7, 2017 •

edited