VERIFY(dn->dn_type != DMU_OT_NONE) failed #6522

AndCycle · 2017-08-16T21:14:21Z

System information

Type	Version/Name
Distribution Name	Gentoo
Linux Kernel	4.12.5-gentoo
Architecture	x86_64
ZFS Version	0.7.1-r0-gentoo
SPL Version	0.7.1-r0-gentoo

Describe the problem you're observing

system stop working after this panic

Describe how to reproduce the problem

running new zfs 0.7.1 on my little home server about a day

Include any warning/errors/backtraces from the system logs

[114432.726444] VERIFY(dn->dn_type != DMU_OT_NONE) failed
[114432.727750] PANIC at dbuf.c:2308:dbuf_create()
[114432.729099] Showing stack for process 28557
[114432.729102] CPU: 3 PID: 28557 Comm: mongod Tainted: P        W  O    4.12.5-gentoo #1
[114432.729103] Hardware name: Intel Corporation S1200RP/S1200RP, BIOS S1200RP.86B.03.03.0003.121820151104 12/18/2015
[114432.729103] Call Trace:
[114432.729110]  dump_stack+0x4d/0x67
[114432.729115]  spl_dumpstack+0x3d/0x40 [spl]
[114432.729117]  spl_panic+0xc3/0x110 [spl]
[114432.729120]  ? getrawmonotonic64+0x82/0xc0
[114432.729124]  ? mutex_lock+0xd/0x30
[114432.729155]  ? refcount_remove_many+0x1e5/0x2d0 [zfs]
[114432.729171]  ? refcount_remove+0x11/0x20 [zfs]
[114432.729185]  ? dbuf_rele_and_unlock+0x1bf/0x4d0 [zfs]
[114432.729198]  dbuf_create+0x68c/0x800 [zfs]
[114432.729211]  ? dbuf_rele+0x46/0x80 [zfs]
[114432.729226]  ? dnode_hold_impl+0x60a/0xb50 [zfs]
[114432.729239]  dbuf_create_bonus+0x39/0xa0 [zfs]
[114432.729253]  dmu_bonus_hold+0x16c/0x210 [zfs]
[114432.729271]  sa_buf_hold+0x9/0x10 [zfs]
[114432.729286]  zfs_zget+0x10e/0x2d0 [zfs]
[114432.729300]  ? zio_rewrite+0x2e/0x30 [zfs]
[114432.729314]  ? zil_lwb_write_init+0x220/0x220 [zfs]
[114432.729316]  ? spl_kmem_cache_free+0x153/0x270 [spl]
[114432.729332]  zfs_get_data+0x57/0x430 [zfs]
[114432.729346]  zil_commit.part.8+0x7d0/0xd50 [zfs]
[114432.729365]  ? rrw_enter_read_impl+0x125/0x220 [zfs]
[114432.729379]  zil_commit+0x12/0x20 [zfs]
[114432.729394]  zpl_writepages+0xd1/0x160 [zfs]
[114432.729397]  do_writepages+0x17/0x60
[114432.729399]  __filemap_fdatawrite_range+0xa5/0xe0
[114432.729402]  filemap_write_and_wait_range+0x3c/0x90
[114432.729426]  zpl_fsync+0x37/0x100 [zfs]
[114432.729428]  vfs_fsync_range+0x44/0xa0
[114432.729430]  ? find_vma+0x63/0x70
[114432.729432]  SyS_msync+0x178/0x1f0
[114432.729434]  entry_SYSCALL_64_fastpath+0x17/0x98
[114432.729435] RIP: 0033:0x7f958f6ee5ed
[114432.729436] RSP: 002b:00007f958d8ce200 EFLAGS: 00000293 ORIG_RAX: 000000000000001a
[114432.729438] RAX: ffffffffffffffda RBX: 000000082b317808 RCX: 00007f958f6ee5ed
[114432.729438] RDX: 0000000000000004 RSI: 0000000001000000 RDI: 00007f95768cc000
[114432.729439] RBP: 00007f958d8ce6d0 R08: 0000000000000000 R09: 0000000000000000
[114432.729440] R10: 0000000000000001 R11: 0000000000000293 R12: 000000082b317808
[114432.729441] R13: 00007f958d8ce4f0 R14: 00007f958d8ce4d0 R15: 000000082b317640

The text was updated successfully, but these errors were encountered:

AndCycle · 2017-08-16T21:18:04Z

as I revert back to 0.6.5.11 for stable system,
so probably not much I can help.

behlendorf · 2017-08-16T21:53:43Z

Sorry you had to rollback. This is a duplicate of #5396 which has been hard to reproduce, but it looks like you found a way. Was there anything special about your workload?

AndCycle · 2017-08-16T23:13:22Z

I am not sure about do I have any special workload,
it's a mix bag except I don't use ZVOL,
in general

(Yes) Single Volume
(Yes) RAIDZ
(No) ZVOL
(Yes) L2ARC device
(No) ZIL device

(Yes) heavy metadata workload
(Maybe) weird database access pattern
(Yes) a little memory pressure

so I will try to layout my storage first

zroot 444G 339G
Intel 730 SSD, attached on LSI SAS HBA
single boot volume, resident for root system and /var
use for DNS/routing/email/web/database,
pretty low utilization

zassist 222G 143G
single volume
Intel 240 SSD, attached on LSI SAS HBA
minecraft server, low usage, but heavy metadata due to web-based maps, lot's ton of little chunk jpeg
CrashPlan backup client side software, due to it's inefficient about local database

ztank 32.5T 12.5T
6TBx6 raidz2, attached through S1200RP onboard
L2ARC on Intel 240 SSD partition2, attached on LSI SAS HBA
use for /home and synoid backup for zroot/zaside
large volume of personal data,
also millions of small file under 10k due to my little web crawler cron job use for scrap twitter for analysis
2 virtual machine by qemu, access through qemu raw image, no zvol
1 is running a pretty old FreeBSD, 1 is a test bench windows 10,

zaside 10.9T 4.69T
4TBx3 raidz1
temporal data storage, mostly through samba

zmess 27.2T 16.4T
6TBx6 raidz2, through external USB3
L2ARC on Intel 240 SSD partition3, attached on LSI SAS HBA
shared storage through samba

zarchive 29T 20.6T
4TBx8 raidz2, through external USB3
mostly inactive for filming video data archive

I didn't attached L2ARC previously due to I remember I left a bug report about leaking stats,
I thought I should attached on 0.7.1 for another go,
unfortunately I can't get result.

although the system has 32GB ECC memory, I do have memory pressure,
it's constantly have some swap activity due to

nearly 16GB use by CrashPlan backup service,
6 GB by virtual machine
BOINC, network computing, dispatch random job use all idle resource
swap resident on Intel 240 SSD partition4, attached on LSI SAS HBA

the constant workload is mostly cause by CrashPlan backup service,
which do full filesystem walk for backup, scan for difference, and it prefer newest small files,
other than that, hardly say there are anything special,

there are daily backup synoid,
daily cron for locate update database, which walk through millions for files

as this crash happaned on Aug 17 04:19:13,

my daily cron job are schedule at AM 12:00,
CrashPlan scan are schedule at AM 03:00,

yea it do suspicious the workload cause by CrashPlan,
it will do both file scan and local database read and database update,

the database use by crashplan is close format, proprietary,
I do some simple strace figured it doesn't align at all, cause heavy write amplification,

here is the simple iops history, disk io graph
you can easily tell crashplan is doing something really inefficient on it's database
that's why I move it to SSD,

but it's hard to tell because crashplan also scan full filesystem in the mean time,

hope this info help

As part of commit 50c957f this check was pulled up before the call to dnode_create(). This is racy since the dnode_phys_t in the dbuf could be updated after the check passed but before it's created by dnode_create(). Close the race by adding the original check back to detect this unlikely case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522

As part of commit 50c957f this check was pulled up before the call to dnode_create(). This is racy since the dnode_phys_t in the dbuf could be updated after the check passed but before it's created by dnode_create(). Close the race by adding the original check back to detect this unlikely case. TEST_ZFSSTRESS_RUNTIME=7200 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522

As part of commit 50c957f this check was pulled up before the call to dnode_create(). This is racy since the dnode_phys_t in the dbuf could be updated after the check passed but before it's created by dnode_create(). Close the race by adding the original check back to detect this unlikely case. TEST_XFSTESTS_SKIP="yes" TEST_ZFSSTRESS_RUNTIME=3000 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522

Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR openzfs#6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522 Closes openzfs#6414 Closes openzfs#6564

AndCycle mentioned this issue Aug 21, 2017

arcstats l2_size accounting error #5474

Closed

behlendorf mentioned this issue Aug 25, 2017

Fix dn->dn_type != DMU_OT_NONE in dbuf_create() #6559

Closed

13 tasks

behlendorf closed this as completed in 4c5b89f Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VERIFY(dn->dn_type != DMU_OT_NONE) failed #6522

VERIFY(dn->dn_type != DMU_OT_NONE) failed #6522

AndCycle commented Aug 16, 2017

AndCycle commented Aug 16, 2017

behlendorf commented Aug 16, 2017

AndCycle commented Aug 16, 2017 •

edited

Loading

VERIFY(dn->dn_type != DMU_OT_NONE) failed #6522

VERIFY(dn->dn_type != DMU_OT_NONE) failed #6522

Comments

AndCycle commented Aug 16, 2017

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

AndCycle commented Aug 16, 2017

behlendorf commented Aug 16, 2017

AndCycle commented Aug 16, 2017 • edited Loading

AndCycle commented Aug 16, 2017 •

edited

Loading