Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs receive -> unable to handle kernel NULL pointer dereference #1518

Closed
edillmann opened this issue Jun 14, 2013 · 21 comments
Closed

zfs receive -> unable to handle kernel NULL pointer dereference #1518

edillmann opened this issue Jun 14, 2013 · 21 comments
Milestone

Comments

@edillmann
Copy link
Contributor

From time to time I get hit by the following BUG:

The kernel is vanilla 3.8.13, with latest zfs commit plus following patches :
20ad617
9bdf8c5
9e300f3

[158179.598272] ---[ end trace 0a7e959cc3d5f499 ]---
[158194.436990] BUG: unable to handle kernel NULL pointer dereference at (null)
[158194.438576] IP: [] dmu_objset_pool+0x6/0x30 [zfs]
[158194.440137] PGD 159d01067 PUD 1a23ab067 PMD 0
[158194.441630] Oops: 0000 [#5] SMP
[158194.443068] Modules linked in: xt_nat(F) iptable_filter(F) ipt_MASQUERADE(F) iptable_nat(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) ip_tables(F) x_tables(F) ip_gre(F) gre(F) bridge(F) stp(F) llc(F) microcode(F) aesni_intel(F) aes_x86_64(F) xts(F) lrw(F) gf128mul(F) ablk_helper(F) ghash_clmulni_intel(F) cryptd(F) i915(F) kvm_intel(F) kvm(F) video(F) drm_kms_helper(F) drm(F) lp(F) i2c_algo_bit(F) parport(F) psmouse(F) serio_raw(F) gpio_ich(F) lpc_ich(F) coretemp(F) zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate(F) raid10(F) raid456(F) async_raid6_recov(F) async_memcpy(F) async_pq(F) async_xor(F) xor(F) async_tx(F) ahci(F) libahci(F) e1000e(F) raid6_pq(F) raid1(F) raid0(F) multipath(F) linear(F) [last unloaded: ipmi_msghandler]
[158194.454385] CPU 0
[158194.454404] Pid: 87148, comm: zfs Tainted: PF D O 3.8.13-lxc #1 /DH67BL
[158194.457678] RIP: 0010:[] [] dmu_objset_pool+0x6/0x30 [zfs]
[158194.459374] RSP: 0018:ffff8801d4fdfb80 EFLAGS: 00010202
[158194.461028] RAX: ffff88017e459710 RBX: ffff8802029d0000 RCX: 00000000000000fe
[158194.462695] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[158194.464372] RBP: ffff8801d4fdfbd0 R08: 0000000000000000 R09: 0000000000000000
[158194.466057] R10: 0000000000000000 R11: ffffffffa00e683b R12: 0000000000000010
[158194.467733] R13: ffff880203ac2c00 R14: 0000000000000001 R15: ffff8802029d0400
[158194.469401] FS: 00007f3359da3b80(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
[158194.471078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[158194.472755] CR2: 0000000000000000 CR3: 0000000195439000 CR4: 00000000000407f0
[158194.474450] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[158194.476128] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[158194.477792] Process zfs (pid: 87148, threadinfo ffff8801d4fde000, task ffff88017e459710)
[158194.479473] Stack:
[158194.481138] ffff8801d4fdfbd0 ffffffffa0215aed ffffffffa01b5b81 ffff8802029d0000
[158194.482868] ffff8802029d0378 ffff8802029d0000 0000000000000010 ffff880203ac2c00
[158194.484612] ffff88017e459710 ffff8802029d0400 ffff8801d4fdfbf8 ffffffffa0216120
[158194.486351] Call Trace:
[158194.488093] [] ? zfs_sb_teardown+0x4d/0x370 [zfs]
[158194.489856] [] ? dsl_dataset_own+0x71/0x80 [zfs]
[158194.491646] [] zfs_umount+0x30/0xc0 [zfs]
[158194.493448] [] zfs_resume_fs+0xfe/0x2a0 [zfs]
[158194.495247] [] zfs_ioc_recv+0x9ee/0xce0 [zfs]
[158194.497044] [] ? dsl_prop_get_ds+0x16e/0x230 [zfs]
[158194.498819] [] ? __kmalloc+0x55/0x170
[158194.500598] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[158194.502389] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[158194.504168] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[158194.505942] [] ? dsl_dataset_drop_ref+0x12/0x20 [zfs]
[158194.507678] [] ? dsl_dataset_rele+0x42/0x120 [zfs]
[158194.509347] [] zfsdev_ioctl+0x493/0x4c0 [zfs]
[158194.510968] [] do_vfs_ioctl+0x99/0x560
[158194.512567] [] sys_ioctl+0x91/0xb0
[158194.514166] [] ? sys_write+0x52/0xa0
[158194.515794] [] system_call_fastpath+0x16/0x1b
[158194.517376] Code: f8 04 00 00 48 c7 c6 80 9c 23 a0 48 c7 c7 58 1f 24 a0 e8 ee 5c f4 ff e9 5c ff ff ff 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 <48> 8b 07 48 89 e5 48 85 c0 74 0f 48 8b 00 48 85 c0 74 07 48 8b
[158194.520965] RIP [] dmu_objset_pool+0x6/0x30 [zfs]
[158194.522694] RSP
[158194.524346] CR2: 0000000000000000

@dweeezil
Copy link
Contributor

Any idea what the system was doing at the time? Got a reliable way to
reproduce it? Unfortunately, dsl_dataset_rele is called in a ton of
places within zfsdev_ioctl and the +0x493 offset isn't helping me find
much of anything. The reason I'm curious is because I'm the one who
ported the 9bdf8c5 commit you referenced above.

@edillmann
Copy link
Contributor Author

The only purpose of this system is receiving zfs streams on a regular interval.
I did rollback the last applied patch (20ad617). For now (3 days), I have no more problems.

@edillmann
Copy link
Contributor Author

I talked too fast :-(

[240595.185428] ---[ end trace 38f5ebf420aad262 ]---
[242963.738292] BUG: unable to handle kernel NULL pointer dereference at (null)
[242963.739891] IP: [] dmu_objset_pool+0x6/0x30 [zfs]
[242963.741452] PGD 1ddb15067 PUD 14cba6067 PMD 0
[242963.742944] Oops: 0000 [#6] SMP
[242963.744382] Modules linked in: xt_nat(F) iptable_filter(F) ipt_MASQUERADE(F) iptable_nat(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) ip_tables(F) x_tables(F) ip_gre(F) gre(F) bridge(F) stp(F) llc(F) i915(F) video(F) drm_kms_helper(F) microcode(F) aesni_intel(F) drm(F) aes_x86_64(F) xts(F) lrw(F) gf128mul(F) psmouse(F) ablk_helper(F) ghash_clmulni_intel(F) cryptd(F) gpio_ich(F) i2c_algo_bit(F) serio_raw(F) lpc_ich(F) kvm_intel(F) kvm(F) coretemp(F) lp(F) parport(F) zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate(F) raid10(F) raid456(F) async_raid6_recov(F) async_memcpy(F) async_pq(F) async_xor(F) xor(F) async_tx(F) ahci(F) libahci(F) e1000e(F) raid6_pq(F) raid1(F) raid0(F) multipath(F) linear(F) [last unloaded: ipmi_msghandler]
[242963.755722] CPU 0
[242963.755739] Pid: 35948, comm: zfs Tainted: PF D O 3.8.13-lxc #1 /DH67BL
[242963.759009] RIP: 0010:[] [] dmu_objset_pool+0x6/0x30 [zfs]
[242963.760699] RSP: 0018:ffff8801427cbb80 EFLAGS: 00010202
[242963.762350] RAX: ffff880201354530 RBX: ffff880201e64000 RCX: 0000000000000046
[242963.764014] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[242963.765689] RBP: ffff8801427cbbd0 R08: 000000000000019e R09: 0000000000000000
[242963.767373] R10: 0000000000000000 R11: ffffffffa00e683b R12: 0000000000000010
[242963.769049] R13: ffff880201d50000 R14: 0000000000000001 R15: ffff880201e64400
[242963.770715] FS: 00007fb8153acb80(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
[242963.772390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[242963.774063] CR2: 0000000000000000 CR3: 0000000139647000 CR4: 00000000000407f0
[242963.775758] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[242963.777431] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[242963.779092] Process zfs (pid: 35948, threadinfo ffff8801427ca000, task ffff880201354530)
[242963.780786] Stack:
[242963.782449] ffff8801427cbbd0 ffffffffa0215aed ffffffffa01b5b81 ffff880201e64000
[242963.784173] ffff880201e64378 ffff880201e64000 0000000000000010 ffff880201d50000
[242963.785906] ffff880201354530 ffff880201e64400 ffff8801427cbbf8 ffffffffa0216120
[242963.787642] Call Trace:
[242963.789382] [] ? zfs_sb_teardown+0x4d/0x370 [zfs]
[242963.791146] [] ? dsl_dataset_own+0x71/0x80 [zfs]
[242963.792938] [] zfs_umount+0x30/0xc0 [zfs]
[242963.794741] [] zfs_resume_fs+0xfe/0x2a0 [zfs]
[242963.796535] [] zfs_ioc_recv+0x9ee/0xce0 [zfs]
[242963.798326] [] ? dsl_prop_get_ds+0x16e/0x230 [zfs]
[242963.800097] [] ? __kmalloc+0x55/0x170
[242963.801873] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[242963.803660] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[242963.805441] [] ? kmem_alloc_debug+0x198/0x3b0 [spl]
[242963.807211] [] ? dsl_dataset_drop_ref+0x12/0x20 [zfs]
[242963.808944] [] ? dsl_dataset_rele+0x42/0x120 [zfs]
[242963.810609] [] zfsdev_ioctl+0x493/0x4c0 [zfs]
[242963.812220] [] do_vfs_ioctl+0x99/0x560
[242963.813814] [] sys_ioctl+0x91/0xb0
[242963.815408] [] ? sys_write+0x52/0xa0
[242963.817013] [] system_call_fastpath+0x16/0x1b
[242963.818588] Code: f8 04 00 00 48 c7 c6 80 9c 23 a0 48 c7 c7 58 1f 24 a0 e8 ee 5c f4 ff e9 5c ff ff ff 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 <48> 8b 07 48 89 e5 48 85 c0 74 0f 48 8b 00 48 85 c0 74 07 48 8b
[242963.822058] RIP [] dmu_objset_pool+0x6/0x30 [zfs]
[242963.823773] RSP
[242963.825421] CR2: 0000000000000000
[242963.833237] ---[ end trace 38f5ebf420aad263 ]---

I will rollback 9bdf8c5 to see what happens ...

@dweeezil
Copy link
Contributor

@edillmann The stack trace above following the zfsdev_ioctl+0x493 looks suspiciously bogus to me. I've got to wonder if something in the zfs_ioc_recv call chain isn't smashing the stack. The 0x493 offset is, itself, suspiciously large compared to the size of the zfsdev_ioctl function. I'll check out all the lower function calls and see if there's any obvious stack reduction that needs to be done.

@dweeezil
Copy link
Contributor

@edillmann I think the problem has to do with some conflicts relative the linux-ization introduced by 3558fd7. I'm going to investigate further.

@edillmann
Copy link
Contributor Author

@dweeezil thank you for tracking that ...

@dweeezil
Copy link
Contributor

@edillmann It appears this problem occurs during an "online" receive; on in which the file system on the receiving side is mounted. Does your receiving system keep its received file systems mounted? The code handling this is immediately following the comment "online recv" in zfs_ioctl.c. I still can't duplicate your problem, however, there's clearly something going wrong with the "wrap-up" code for online receives in which the mounted file system (on the receiving side) is briefly unmounted and re-mounted in order to make it match the received file system. I think the call to dmu_resume_fs is giving a NULL pointer and that's what's causing your panic.

Possibly related, I can trigger an ASSERT rather easily in my debug-enabled module when doing the equivalent rollback operation (which similarly requires the remounting operation).

I'm continuing to investigate.

@edillmann
Copy link
Contributor Author

@dweeezil I confirm that the receiving filesystem's are mounted

@dweeezil
Copy link
Contributor

It seems that issue #1536 is not related to this problem and now that I've got a small patch to deal with it, I can continue investigating this NULL dereference (which, so far, I've not been able to duplicate).

@dweeezil
Copy link
Contributor

@edillmann I've tried to make sense out of this stack and as far as I can tell, the actual trace is:

dmu_objset_pool
zfs_sb_teardown
zfs_umount
zfs_resume_fs
zfs_ioc_recv

The problem occurs as part of the remounting in zfs_ioc_recv. The actual work is performed in zfs_resume_fs by code that was added in 7b3e34b. Apparently, prior to this commit, rollbacks on mounted file systems didn't work properly. Receives to mounted file system (so-called "online recvs") likely shared a similar property because they both rely on much of the same code.

After the stream is received during an online recv, I/O is suspended to the receiving file system while the DMU finishes dealing with the stream. Then the I/O is resumed. That's where your NULL dereference is occurring. For some reason, an error is occurring during the resuming phase that cause it to try to unmount the file system. The panic occurs during the unmounting.

I've got no idea why this is happening to you. The patch does, however, add a bunch of new pool logging that might be useful. You should grab the output of a "zpool history -i" in the receiving side. The logging in there might give us a clue as to what happened and/or how far it got.

I can also whip up a patch to log the cause of the failure that is triggering the umount operation. That would at least tell us what part is failing that presumably shouldn't be.

@behlendorf
Copy link
Contributor

@dweeezil Regarding the stack above you can usually ignore the lines which begin with ?'s. They are purely speculative and just indicate that the address for that particular symbol was found on the stack. They are likely the return addresses from a call path recently walked and not overwritten. So the stack really looks like this:

[242963.738292] BUG: unable to handle kernel NULL pointer dereference at (null)
[242963.759009] RIP: 0010:[] [] dmu_objset_pool+0x6/0x30 [zfs]
[242963.789382] [] zfs_sb_teardown+0x4d/0x370 [zfs]
[242963.792938] [] zfs_umount+0x30/0xc0 [zfs]
[242963.794741] [] zfs_resume_fs+0xfe/0x2a0 [zfs]
[242963.796535] [] zfs_ioc_recv+0x9ee/0xce0 [zfs]
[242963.810609] [] zfsdev_ioctl+0x493/0x4c0 [zfs]
[242963.812220] [] do_vfs_ioctl+0x99/0x560
[242963.813814] [] sys_ioctl+0x91/0xb0

I also think there's a very good chance that zsb->z_os = NULL which is why you're getting a NULL dereference in dmu_objset_pool. That can happen if there's a concurrent umount and described in the comments slightly further in to zfs_sb_teardown. I also notice there's no protection for this case around the following code chunk which is Linux specific.

        /*
         * Drain the iput_taskq to ensure all active references to the
         * zfs_sb_t have been handled only then can it be safely destroyed.
         */
        taskq_wait(dsl_pool_iput_taskq(dmu_objset_pool(zsb->z_os)));

I suspect wrapping the above code in a if (zsb->z_os) { ... } check would resolve the issue. You may not have been able to trigger this in your testing because it return a concurrent umount. These umount may be happening on @edillmann's system if he's accessing the .zfs/snapshot directories which will auto unmount after being idle for 5 minutes.

The first ASSERT() you mentioned is also somewhat troubling to me, I'd expect there to be a reference. However, I suppose since the umount isn't originating from the VFS we have no reason to believe the inode cache has already been pruned. Normally the VFS will take care of this before calling zfs_umount().

@edillmann
Copy link
Contributor Author

Hi,

For now I did unmount (and canmount=false) on the receiving side and the problem did not occur again.

To help analysis of this problem I did a zpool history -i

The following pattern is repeating :

2013-06-12.09:27:17 zfs rollback -r bfs/vms/lxc@zo20130612013742
2013-06-12.09:27:17 [internal replay_inc_sync txg:1104307] dataset = 8358
2013-06-12.09:27:17 [internal snapshot txg:1104309] dataset = 8362
2013-06-12.09:27:17 [internal destroy txg:1104310] dataset = 8358
2013-06-12.09:27:17 [internal property set txg:1104310] reservation=0 dataset = 8358
2013-06-12.09:27:22 zfs receive -v bfs/vms/lxc
2013-06-12.09:27:23 [internal destroy txg:1104312] dataset = 13999
2013-06-12.09:27:24 zfs destroy bfs/vms/lxc@zo20130612001251
2013-06-12.09:27:24 [internal create txg:1104313] dataset = 8373
2013-06-12.09:27:24 [internal destroy txg:1104315] dataset = 8373
2013-06-12.09:27:24 [internal property set txg:1104315] reservation=0 dataset = 8373

Thank you for the time you are spending tracking such problem down.

Regards,
Eric

@dweeezil
Copy link
Contributor

@edillmann Were you still getting the NULL dereference after rolling back 9bdf8c5? FYI, that commit is an older version of my pull request #1496 but the only difference is the commit log message and authorship (the code should be the same).

@dweeezil
Copy link
Contributor

@behlendorf I find it somewhat coincidental that if the dmu_objset_own() fails in zfs_resume_fs, it does zsb->z_os = NULL; and then later zfs_umount(zsb->z_sb);. Then zfs_umount calls zfs_sb_teardown with z_sb->s_fs_info and that's what's passed to dmu_objset_pool as s_fs_info->z_os. Could the s_fs_info be pointing back to the original zsb in this case? If so, that's where the NULL z_os would be coming from.

I noticed that in both Illumos and FreeBSD, the umount call has a protective "if" statement:

                if (vn_vfswlock(zfsvfs->z_vfs->vfs_vnodecovered) == 0)
                        (void) dounmount(zfsvfs->z_vfs, MS_FORCE, CRED());

Do we need something similar? Maybe a comparison of zsb->z_sb->s_fs_info against zsb?

@edillmann
Copy link
Contributor Author

@dweeezil I'm not getting the NULL dereference any more since I set canmount to false and unmounted the receiving side filesystems.

@behlendorf
Copy link
Contributor

Could the s_fs_info be pointing back to the original zsb in this case? If so, that's where the NULL z_os would be coming from.

Yes, it absolutely does. By definition zsb->z_sb points to the vfs struct super block and sb->s_fs_info points to the private zfs_sb_t. So your right that's likely where the NULL is coming from.

Adding a similar check around the zfs_umount() should be a pretty nice clean fix. I'd suggest something like this to stay as close to the FreeBSD/Illumos implementation as possible. I'd also suggest we add a similar check in zfs_sb_teardown around the taskq_wait. If there's no valid object set this can't happen anyway.

@dweeezil Can you put together a patch for this and @edillmann can you verify it solves the issue?

        if (err) {
                /*
                 * Since we couldn't reopen zfs_sb_t, or
                 * setup the sa framework force unmount this file system.
                 */
                if (zsb->z_os)
                        (void) zfs_umount(zsb->z_sb);
        }

@dweeezil
Copy link
Contributor

@edillmann Could you please try dweeezil/zfs@e5a4d3c. It's a small patch which should protect against this type of NULL dereference in the few places it can occur. Of course, you'll have to re-mount your file systems on the receiving system while testing this. This should layer on top of the 9bdf8c5 you had been using or you can use its replacement commit of dweeezil/zfs@8592f00 which differs only in the log message.

@behlendorf If this works, should I roll it into my pull request? It's not clear to my why my #1496 patch should have caused this problem in which case this patch ought to be turned into a separate pull request based on master.

@behlendorf
Copy link
Contributor

@dweeezil This does look like a long standing issue so I think a separate pull request is the most appropriate.

@dweeezil
Copy link
Contributor

@behlendorf I just created submitted this as pull request #1543 and backed it out of #1496.

@edillmann You either grab the original patch I mentioned or dweeezil/zfs@47a432c which is from the standalone pull request #1543.

Urgh... I hate typos in commit logs but I also hate re-writing commits. The log needs s/NULl/NULL/.

@edillmann
Copy link
Contributor Author

@dweeezil thank's for this work, the version is deployed and i will wait several days before closing this issue.

@behlendorf
Copy link
Contributor

This was merged long ago and since we haven't heard anything I'm closing this as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants