Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

Closed
phik opened this issue Jul 27, 2017 · 25 comments
Closed

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

phik opened this issue Jul 27, 2017 · 25 comments

Comments

@phik
Copy link

phik commented Jul 27, 2017

Type Version/Name
Distribution Name CentOS
Distribution Version 7.3.1611
Linux Kernel 3.10.0-514.6.1.el7.x86_64
Architecture intel x64
ZFS Version 0.7.0-rc5
SPL Version 0.7.0-rc5

Describe the problem you're observing

This particular ZFS filesystem is the backing store for a Lustre MDS. After a Lustre LBUG:

2017-07-26T07:00:11-05:00 hmet0004 kernel: [36445.672990] LustreError: 28630:0:(osd_object.c:1284:__osd_object_create()) ASSERTION( rc == 0 ) failed: sa_buf_hold 18208548 failed: -2
2017-07-26T07:00:11-05:00 hmet0004 kernel: [36445.688811] LustreError: 28630:0:(osd_object.c:1284:__osd_object_create()) LBUG

...and subsequent reboot, any attempt to import the pool now gives this panic:

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0
Showing stack for process 31721
CPU: 3 PID: 31721 Comm: zpool Tainted: P           OE  ------------   3.10.0-514.6.1.el7.x86_64 #1
Hardware name: GIGABYTE D120-C21/MC20-S10, BIOS F02 05/19/2016
 0000000000000003 000000009d6dbe62 ffff881ec47bb1a0 ffffffff816862ac
 ffff881ec47bb1b0 ffffffffa02be234 ffff881ec47bb2d8 ffffffffa02be39c
 0000000000100000 61207274706b6c62 3838666666662074 3431303638316631
Call Trace:
 [<ffffffff816862ac>] dump_stack+0x19/0x1b
 [<ffffffffa02be234>] spl_dumpstack+0x44/0x50 [spl]
 [<ffffffffa02be39c>] vcmn_err+0x6c/0x110 [spl]
 [<ffffffffa03fefb8>] ? vdev_raidz_io_start+0x128/0x320 [zfs]
 [<ffffffffa03fd6c0>] ? vdev_raidz_asize+0x60/0x60 [zfs]
 [<ffffffff811de0c5>] ? kmem_cache_alloc+0x35/0x1e0
 [<ffffffffa02ba319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
 [<ffffffffa02ba319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
 [<ffffffffa03eb079>] zfs_panic_recover+0x69/0x90 [zfs]
 [<ffffffffa03838c0>] ? arc_read+0xa80/0xa80 [zfs]
 [<ffffffffa0442011>] zfs_blkptr_verify+0x311/0x370 [zfs]
 [<ffffffffa04420a4>] zio_read+0x34/0xe0 [zfs]
 [<ffffffff811dd4c1>] ? __kmalloc_node+0x1d1/0x2b0
 [<ffffffffa02b91a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
 [<ffffffffa02b91a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
 [<ffffffffa03838c0>] ? arc_read+0xa80/0xa80 [zfs]
 [<ffffffffa038343f>] arc_read+0x5ff/0xa80 [zfs]
 [<ffffffffa03a5873>] traverse_prefetch_metadata+0xd3/0xe0 [zfs]
 [<ffffffffa03a58f6>] prefetch_dnode_metadata+0x76/0xf0 [zfs]
 [<ffffffffa03a6355>] traverse_visitbp+0x665/0x840 [zfs]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa02b9037>] ? spl_kmem_alloc+0xc7/0x170 [spl]
 [<ffffffffa03a5fe5>] traverse_visitbp+0x2f5/0x840 [zfs]
 [<ffffffffa03a6bb1>] traverse_dnode+0xb1/0x1f0 [zfs]
 [<ffffffffa03a6475>] traverse_visitbp+0x785/0x840 [zfs]
 [<ffffffffa03a6701>] traverse_impl+0x1d1/0x450 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03a6d45>] traverse_dataset_resume+0x55/0x60 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03a6f0b>] traverse_pool+0x19b/0x1d0 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03d9c50>] ? spa_async_suspend+0x80/0x80 [zfs]
 [<ffffffffa03e0643>] spa_load+0x1983/0x1f60 [zfs]
 [<ffffffffa03e0c7e>] spa_load_best+0x5e/0x290 [zfs]
 [<ffffffffa03e285a>] spa_import+0x1ca/0x6d0 [zfs]
 [<ffffffffa041b404>] zfs_ioc_pool_import+0xf4/0x130 [zfs]
 [<ffffffffa041f346>] zfsdev_ioctl+0x506/0x550 [zfs]
 [<ffffffff81212025>] do_vfs_ioctl+0x2d5/0x4b0
 [<ffffffff81691db1>] ? __do_page_fault+0x171/0x450
 [<ffffffff812122a1>] SyS_ioctl+0xa1/0xc0
 [<ffffffff816968c9>] system_call_fastpath+0x16/0x1b

We've tried every suggested workaround we could find in other issues and mailing list threads, to no avail. Including:

  • swapping the disks into a different chassis (because hope springs eternal in the human breast, I suppose) -- panic

  • importing the pool with just a single disk -- panic

  • import -FXn and just import -FX -- panic

  • the same but with zfs_recover=1 -- panic

  • listing the uberblocks with zdb -ul, getting the txg for the last uberblock (by date), and trying a zpool import -T txg -o readonly. This at least did something, which was to read the disks at 2 MB/s. Unfortunately at that rate we were looking at 23 days to read the whole thing, so we gave up after 4 hours.

At one point -- I'm not sure if it was import -FX or zfs_recover=1 -- something changed the nature of the failure from an instant panic at the first "invalid CHECKSUM" error into a series of warnings followed by an assertion:

WARNING: blkptr at ffff881ebe6d5440 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5440 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5440 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d5440 DVA 1 has invalid VDEV 959721836
WARNING: blkptr at ffff881ebe6d5440 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d5740 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5740 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5840 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5840 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d5840 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d5840 DVA 1 has invalid VDEV 926298476
WARNING: blkptr at ffff881ebe6d5840 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d5b40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d5b40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6140 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6140 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6240 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6240 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6240 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6240 DVA 1 has invalid VDEV 825635180
WARNING: blkptr at ffff881ebe6d6240 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d6540 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6540 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6640 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6640 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6640 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6640 DVA 1 has invalid VDEV 809054572
WARNING: blkptr at ffff881ebe6d6640 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d6940 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6940 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6a40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6a40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6a40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6a40 DVA 1 has invalid VDEV 825375084
WARNING: blkptr at ffff881ebe6d6a40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d6d40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6d40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6e40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d6e40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d6e40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d6e40 DVA 1 has invalid VDEV 925971052
WARNING: blkptr at ffff881ebe6d6e40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d7140 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7140 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7240 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7240 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7240 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7240 DVA 1 has invalid VDEV 875573612
WARNING: blkptr at ffff881ebe6d7240 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d7540 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7540 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7640 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7640 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7640 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7640 DVA 1 has invalid VDEV 876163436
WARNING: blkptr at ffff881ebe6d7640 DVA 2 has invalid VDEV 775303544
WARNING: blkptr at ffff881ebe6d7940 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7940 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7a40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7a40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7a40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7a40 DVA 1 has invalid VDEV 875837548
WARNING: blkptr at ffff881ebe6d7a40 DVA 2 has invalid VDEV 1680750129
WARNING: blkptr at ffff881ebe6d7d40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7d40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7e40 has invalid CHECKSUM 0
WARNING: blkptr at ffff881ebe6d7e40 has invalid COMPRESS 0
WARNING: blkptr at ffff881ebe6d7e40 DVA 0 has invalid VDEV 1969487872
WARNING: blkptr at ffff881ebe6d7e40 DVA 1 has invalid VDEV 925905516
WARNING: blkptr at ffff881ebe6d7e40 DVA 2 has invalid VDEV 775303544
VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768)
PANIC at zio.c:266:zio_buf_alloc()

We get the same set of warnings from zdb -AAA -F -e:

Configuration for import:
        vdev_children: 1
        version: 5000
        pool_guid: 15267167644730928330
        name: 'lustre'
        state: 0
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15267167644730928330
            children[0]:
                type: 'raidz'
                id: 0
                guid: 12487255120635568856
                nparity: 3
                metaslab_array: 71
                metaslab_shift: 38
                ashift: 12
                asize: 40003264577536
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 11152527491519449036
                    devid: 'ata-HGST_HUH721010ALE600_7PH35SUC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221100000000-lun-0'
                    whole_disk: 1
                    DTL: 204
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH35SUC1'
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 12217561655882834813
                    devid: 'ata-HGST_HUH721010ALE600_7PH36B6C-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221103000000-lun-0'
                    whole_disk: 1
                    DTL: 203
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH36B6C1'
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 14383882701081049619
                    devid: 'ata-HGST_HUH721010ALE600_7PH377MC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221101000000-lun-0'
                    whole_disk: 1
                    DTL: 202
                    create_txg: 4
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH377MC1'
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 11902102953401244559
                    devid: 'ata-HGST_HUH721010ALE600_7PH395GC-part1'
                    phys_path: 'pci-0000:05:00.0-sas-0x4433221102000000-lun-0'
                    whole_disk: 1
                    DTL: 201
                    create_txg: 4
                    offline: 1
                    path: '/dev/mapper/snap_ata-HGST_HUH721010ALE600_7PH395GC1'
        rewind-policy:
            rewind-request-txg: 18446744073709551615
            rewind-request: 8
(all of the above warnings)
c < (1ULL << 24) >> 9 (0x7fffffffffffff < 0x8000)
ASSERT at ../../module/zfs/zio.c:266:zio_buf_alloc()Aborted

I guess there are two paths forward from here:

  1. A fairly benign kernel panic in a largely-unrelated component somehow leads to an unusable file system? This is terrifying!

  2. Having reached this stage, is there anything at all that can be done to recover from it? It seems like the data must all be there -- after all, the system went from fully operational to unusable in an instant -- we just can't get to it.

I'd be eternally grateful for any advice!

@phik
Copy link
Author

phik commented Jul 27, 2017

We've used the (truly terrifying) zfs_revert script (from https://gist.github.com/jshoward/5685757). Removing 1 transaction was not sufficient -- still got the panic -- but after removing five transactions, we can mount again!

The filesystem seems to be functional at that point -- we're able to create 100,000 files and write a GB of data -- and we've brought Lustre back up.

We still have the original dd copies of the raw corrupted disks, so we can still try to learn what's happened here and how to "properly" recover.

@behlendorf
Copy link
Contributor

That is a scary script! I'm glad you were able to use it to rollback to a point where you could import the pool and mount the filesystems. This is exactly the sort of thing zpool import -FXn should have been able to do, but clearly couldn't because of the panic you're seeing.

As for the issue itself we saw something very similar to this on a test system while testing out Lustre 2.10 prior to it being officially released. Based on the pool damage we observed, our best guess as to what must have happened is that somehow Lustre overwrote the some memory containing block pointers due, valid checksums were then generated for the garbage block pointers and that was written to disk.

This is why the zpool import -FXn fails with a panic. ZFS trusts the contents of the blocks because the checksum they were written with is valid. One of the planned development items is to add some additional sanity checks to the ZFS code to verify certain data structures read from disk before using them. The damaged pool you saved would be a great test case for this.

To my knowledge we haven't had any reports of this sort of thing happening when using ZFS without Lustre. Perhaps @adilger knows if any issues like this were recently we resolved in Lustre?

@phik
Copy link
Author

phik commented Jul 28, 2017

When I went looking for this error in the code:

WARNING: blkptr at ffff881ebe6d5440 has invalid CHECKSUM 0

...what I found suggested that the checksum "type" is an invalid value, and that it therefore didn't even know which algorithm to use to validate the checksum?

Or have I misinterpreted?

@behlendorf
Copy link
Contributor

You've got it exactly right. The other warnings you reported show variations on similar forms of damage (unknown compression types, impossible DVAs, etc). All of which suggest the block pointer somehow got stomped on.

@phik
Copy link
Author

phik commented Jul 28, 2017

Then I don't know how to square that with your previous statement:

This is why the zpool import -FXn fails with a panic. ZFS trusts the contents of the blocks because the checksum they were written with is valid.

How can the block checksum be valid, if it doesn't know which algorithm to use to validate?

@behlendorf
Copy link
Contributor

Exactly! ZFS inherently trusts that if the checksum for a block is good, which it was, the contents of that block will be sane. Somehow blocks were written to the pool with valid checksums and nonsense values. That should be impossible, the only plausible way that can happen I'm aware of is due to memory corruption. The block pointers blocks gets damaged in memory after being created but before they're checksummed and written.

@phik
Copy link
Author

phik commented Jul 28, 2017

I'm still confused about how it can know that the checksum is valid if it doesn't even know which checksum algorithm was used -- but I trust you. :)

@behlendorf
Copy link
Contributor

behlendorf commented Jul 28, 2017

The checksum for the damaged block and it's type are stored in the parent. The block pointer it's complaining about is for one of its children. :)

@phik
Copy link
Author

phik commented Jul 28, 2017

Ah ha! You've drawn the veil from my eyes. I'm now very slightly less ignorant. Thank you!

@sdm900
Copy link

sdm900 commented Jul 29, 2017

Thanks for the insight. How do you want the pool? Its currently 3 x 10T raw disk images (dd of the raw device). I suspect only ~2.1T of each disk is really in use.

@sdm900
Copy link

sdm900 commented Aug 1, 2017

I've just read this again and wondered why, when this corruption is identified, isn't the previous txg used? Continually dropping back until a sane txg is found? Surely almost any issue should either cause the FS not to mount and allow the user to take alternative action, or drop back (as requested by the user) to a sane block?

@Lalufu
Copy link
Contributor

Lalufu commented Aug 1, 2017

My guess would be that noone anticipated that this could happen, and the code to handle it isn't present.

@behlendorf behlendorf mentioned this issue Aug 1, 2017
13 tasks
@behlendorf
Copy link
Contributor

@phik we think we've identified the root cause here and opened #6439 with a proposed fix for review. My suggestion for the moment is to set dnodesize=legacy on your MDS for now as a workaround until we finalize the fix and get it merged.

Continually dropping back until a sane txg is found

Yes, this is exactly how the zpool import -Fx logic is implemented. The trouble here is that the block has a valid checksum which is what ZFS uses to verify it should trust something read from disk. As mentioned above we're looking in to adding addition layers of sanity checking to more gracefully handle situations like this one which shouldn't be possible unless there's a bug. And we think we found that bug.

@behlendorf behlendorf added this to TODO in 0.7.1 Aug 1, 2017
@phik
Copy link
Author

phik commented Aug 1, 2017

Innnteresting. I definitely don't understand the internals well enough to comment on the patch, but your description would certainly seem to fit the symptom.

Thanks very much for the prompt followup! I'll discuss your suggestion with the team ASAP.

@behlendorf behlendorf removed this from TODO in 0.7.1 Aug 5, 2017
@sdm900
Copy link

sdm900 commented Aug 7, 2017

FWIW we just got this panic again on a 0.7.0-rc3 system

[375528.898326] PANIC: blkptr at ffff881785cfc440 has invalid CHECKSUM 0
[375528.906484] Showing stack for process 221
[375528.912237] CPU: 3 PID: 221 Comm: spl_system_task Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
[375528.924617] Hardware name: GIGABYTE D120-C21/MC20-S10, BIOS F02 05/19/2016
[375528.933127] 0000000000000003 000000003bbe38b8 ffff881fe100f3a0 ffffffff816862ac
[375528.942221] ffff881fe100f3b0 ffffffffa02be234 ffff881fe100f4d8 ffffffffa02be39c
[375528.951631] 0000000000100000 61207274706b6c62 3838666666662074 3463666335383731
[375528.960947] Call Trace:
[375528.965138] [] dump_stack+0x19/0x1b
[375528.971914] [] spl_dumpstack+0x44/0x50 [spl]
[375528.979440] [] vcmn_err+0x6c/0x110 [spl]
[375528.986601] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375528.995011] [] ? kmem_cache_alloc+0x193/0x1e0
[375529.002551] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375529.010875] [] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[375529.019175] [] zfs_panic_recover+0x69/0x90 [zfs]
[375529.026943] [] ? arc_read+0xa30/0xa30 [zfs]
[375529.034226] [] zfs_blkptr_verify+0x311/0x370 [zfs]
[375529.042129] [] zio_read+0x34/0xe0 [zfs]
[375529.049037] [] ? __kmalloc_node+0x1d1/0x2b0
[375529.056289] [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[375529.064032] [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[375529.072051] [] ? arc_read+0xa30/0xa30 [zfs]
[375529.079230] [] arc_read+0x5ff/0xa30 [zfs]
[375529.086195] [] traverse_prefetcher+0x14c/0x180 [zfs]
[375529.094123] [] traverse_visitbp+0x4e9/0x840 [zfs]
[375529.101733] [] traverse_dnode+0xb1/0x1f0 [zfs]
[375529.109096] [] traverse_visitbp+0x6bb/0x840 [zfs]
[375529.116641] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.124188] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.131715] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.139194] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.146601] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.154036] [] traverse_visitbp+0x2f5/0x840 [zfs]
[375529.161398] [] traverse_dnode+0xb1/0x1f0 [zfs]
[375529.168496] [] traverse_visitbp+0x785/0x840 [zfs]
[375529.175806] [] traverse_prefetch_thread+0x108/0x180 [zfs]
[375529.183814] [] ? prefetch_needed.isra.5+0x40/0x40 [zfs]
[375529.191620] [] taskq_thread+0x246/0x470 [spl]
[375529.198513] [] ? wake_up_state+0x20/0x20
[375529.204969] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[375529.212370] [] kthread+0xcf/0xe0
[375529.218115] [] ? kthread_create_on_node+0x140/0x140
[375529.225444] [] ret_from_fork+0x58/0x90
[375529.231689] [] ? kthread_create_on_node+0x140/0x140

the machine was still up and running... hopefully comes back painlessly after reboot...

@sdm900
Copy link

sdm900 commented Aug 7, 2017

Rebooted just fine.

@behlendorf
Copy link
Contributor

@sdm900 the exact issue reported here was introduced in 0.7.0-rc5 and will be fixed in 0.7.1.

@sdm900
Copy link

sdm900 commented Aug 7, 2017

That concerns me a little.

tonyhutter pushed a commit that referenced this issue Aug 8, 2017
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6414
Closes #6439
@behlendorf
Copy link
Contributor

This issue was resolved in the v0.7.1 release tagged on August 8th.

@phik
Copy link
Author

phik commented Aug 10, 2017

I have a few lingering questions, if I may...

  • Out of curiosity, how reliable is that test that you committed? Did that reproduce the problem 100% of the time on rc5?

  • Stu says that we had this issue again on an rc3 system (in his comment of Aug 6). How can that be, if the problem was introduced in rc5?

  • We still have the fact that zpool import -Fx couldn't import this filesystem -- should this be filed separately?

  • There's also the (separate?) issue that zpool import -T couldn't do in finite time what the horrifying python script did very quickly. import -T seemed to be reading the entire filesystem at 2 MB/s. Should this be filed separately?

@behlendorf
Copy link
Contributor

Out of curiosity, how reliable is that test that you committed?

It reproduced the issue 100% of the time in about 5 seconds. After applying the fix I left it running in a loop for about 8 hours without issue. We were able to reproduce the issue on one of our Lustre test systems although it took longer to reproduce, after applying the fix a week ago we haven't seen it since.

Stu says that we had this issue again on an rc3 system (in his comment of Aug 6). How can that be, if the problem was introduced in rc5?

I agree, that is very troubling. The PANIC only indicates that a damaged block pointer was encountered, it doesn't reveal how it ended up damaged. The issue fixed here was caused by two dnodes stomping on each other and is understood, unfortunately the root cause for @smd900's PANIC must be different.
@sdm900 if possible it would be great if you could scrub the pool and see if anything is reported.

We still have the fact that zpool import -Fx couldn't import this filesystem -- should this be filed separately?

Yes. We've talked about doing this development work for quite a while but I don't think we've ever opened an issue for it. It's a fairly large development effort.

There's also the (separate?) issue that zpool import -T couldn't do in finite time what the horrifying python script did very quickly. import -T seemed to be reading the entire filesystem at 2 MB/s. Should this be filed separately?

Yes, it would be great to open an issue for that too. We're going to need to investigate where exactly zpool import -T end up spending all that time.

@sdm900
Copy link

sdm900 commented Aug 11, 2017

I'd love to scrub the file system (and I'll set it going)... but I don't think it will ever finish. It is on a production file system that is heavily used and we will no doubt loose power before it finishes :(

@sdm900
Copy link

sdm900 commented Aug 14, 2017

As far as I can tell, zfs scrub is useless...

  scan: scrub in progress since Sat Aug 12 20:01:57 2017  
        540G scanned out of 6.32T at 6.26M/s, 269h20m to go
        0 repaired, 8.35% done

and this file system has gone quiet over the weekend. I expect the rate to drop to almost 0 when people are in the office.

@behlendorf
Copy link
Contributor

@sdm900 for filesystems with lots of small files or heavy fragmentation I agree it can be unreasonably slow. There are patches in #6256 under development to improve the situation. But it sounds like you may want to just cancel, or pause, the scrub for now if it's not going to complete in a reasonable amount of time.

behlendorf pushed a commit that referenced this issue Sep 5, 2017
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5396 
Closes #6522 
Closes #6414 
Closes #6564
Fabian-Gruenbichler pushed a commit to Fabian-Gruenbichler/zfs that referenced this issue Sep 6, 2017
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6414
Closes openzfs#6439
@dioni21
Copy link
Contributor

dioni21 commented Jul 17, 2018

@behlendorf Sorry to ask here, but...

Is there currently any tool to fix the invalid CHECKSUM panic?

I'm not sure, but the error could have been caused by something related to the kernel bug reported at #7723 (https://bugzilla.redhat.com/show_bug.cgi?id=1598462)

Sadly, the host panicked a few seconds after loading zfs driver, could never ever try to run zfs import -F. Since I had two disks, isolated one for later post-mortem and restarted send/recv from zero to a single disk. That's when I found the kernel bug mentioned above. 😭

FransUrbo pushed a commit to FransUrbo/zfs that referenced this issue Apr 28, 2019
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6414 
Closes openzfs#6439
FransUrbo pushed a commit to FransUrbo/zfs that referenced this issue Apr 28, 2019
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR openzfs#6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes openzfs#5396 
Closes openzfs#6522 
Closes openzfs#6414 
Closes openzfs#6564
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants