Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oops during zfs send / recv on 3.17.4-gentoo #2946

Closed
Space2Man opened this issue Dec 5, 2014 · 36 comments
Closed

Oops during zfs send / recv on 3.17.4-gentoo #2946

Space2Man opened this issue Dec 5, 2014 · 36 comments

Comments

@Space2Man
Copy link

Hi,

while trying to migrate one datapool to a different pool via zfs send / receive I get the following Oops:

[17649.786174] Oops: 0000 [#1] SMP
[17649.786202] Modules linked in: uas w83627ehf hwmon_vid ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables bonding ftdi_sio stv6110x(O) lnbp21(O) zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) stv090x(O) x86_pkg_temp_thermal coretemp lpc_ich mfd_core ddbridge(O) cxd2099(O) dvb_core(O) fuse
[17649.786501] CPU: 1 PID: 8477 Comm: txg_sync Tainted: P O 3.17.4-gentoo #1
[17649.786543] Hardware name: /DQ77MK, BIOS MKQ7710H.86A.0064.2013.1003.1058 10/03/2013
[17649.786592] task: ffff88079cfe6d60 ti: ffff8807a1804000 task.ti: ffff8807a1804000
[17649.786659] RIP: 0010:[] [] zap_create_claim+0x4b/0x2d0 [zfs]
[17649.786728] RSP: 0018:ffff8807a1807b68 EFLAGS: 00010282
[17649.786762] RAX: 000000000000001d RBX: ffff8807de0ec800 RCX: 000000000000001e
[17649.786806] RDX: 000000000000001d RSI: 000000000000001c RDI: ffff8807de0ec800
[17649.786850] RBP: ffff8807a1807bd8 R08: 0000000000000000 R09: 0000000000000002
[17649.786894] R10: ffff8807a1807978 R11: ffffffffa00c6b71 R12: 000000000000001c
[17649.786937] R13: 000000000000001d R14: ffff88076474c000 R15: 0000000000000000
[17649.786981] FS: 0000000000000000(0000) GS:ffff88081e240000(0000) knlGS:0000000000000000
[17649.787031] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17649.787067] CR2: 0000000000000018 CR3: 0000000002014000 CR4: 00000000001427e0
[17649.787110] Stack:
[17649.787126] ffff8803145677a0 ffff8807a1807c00 ffff880110ff3140 0000000000000000
[17649.787179] ffff8807a1807be8 ffffffffa0162804 ffff8801ad946720 0000000000000001
[17649.787233] ffff8807a1807bd8 ffff88076474c000 0000000000000000 ffff880358fa5240
[17649.787287] Call Trace:
[17649.787317] [] ? dmu_bonus_hold+0xe4/0x960 [zfs]
[17649.787375] [] spa_feature_decr+0x4b/0xc0 [zfs]
[17649.787426] [] ? bptree_is_empty+0x82/0x90 [zfs]
[17649.787488] [] dsl_scan_sync+0x7ea/0xa10 [zfs]
[17649.787524] [] ? spl_kmem_cache_free+0x11f/0x430 [spl]
[17649.787572] [] spa_sync+0x4e7/0xb30 [zfs]
[17649.787616] [] ? autoremove_wake_function+0x11/0x40
[17649.787658] [] ? __wake_up_common+0x55/0x90
[17649.787698] [] ? ktime_get_ts64+0x49/0xf0
[17649.787760] [] txg_sync_start+0x6be/0x8c0 [zfs]
[17649.787812] [] ? txg_sync_start+0x3b0/0x8c0 [zfs]
[17649.787861] [] spl_kmem_fini+0xa3/0xc0 [spl]
[17649.787900] [] ? spl_kmem_fini+0x30/0xc0 [spl]
[17649.787940] [] kthread+0xc4/0xe0
[17649.787973] [] ? kthread_worker_fn+0x100/0x100
[17649.788013] [] ret_from_fork+0x7c/0xb0
[17649.788049] [] ? kthread_worker_fn+0x100/0x100
[17649.788086] Code: 55 48 89 d0 48 89 e5 48 83 ec 70 48 89 5d d8 48 89 fb 4c 89 65 e0 49 89 f4 4c 89 6d e8 49 89 d5 4c 89 7d f8 4d 89 c7 4c 89 75 f0 <41> 8b 70 18 48 89 4d b8 b9 08 00 00 00 49 8b 50 08 44 89 4d ac
[17649.788334] RIP [] zap_create_claim+0x4b/0x2d0 [zfs]
[17649.788388] RSP
[17649.788411] CR2: 0000000000000018
[17649.799934] ---[ end trace e90bc72ab65cd506 ]---

The following versions are used:

sys-fs/zfs-0.6.3-r2
sys-fs/zfs-kmod-0.6.3-r1
sys-kernel/spl-0.6.3-r1

Kernel version is

Linux playstation 3.17.4-gentoo #1 SMP Tue Dec 2 14:05:11 CET 2014 x86_64 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux

Are there any further infos I can provide?

Thanks and best regards,

Jochen

@Space2Man
Copy link
Author

I tested again with different kernel (3.15.10-gentoo) where the zfs-kmod did not contain the patches from zfs-kmod-0.6.3-r1.ebuild ... there the same action works. So I guess it's one of the zfs kernel module patches from 0.6.3-r1.ebuild. The userland was not changed between tests.

Best regards,

Jochen

@markovendelin
Copy link

I am on gentoo as well and I see something similar during a boot:

[ 375.282116] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 375.282370] IP: [] zap_create_claim+0x59/0x2d0 [zfs]
[ 375.282560] PGD 0
[ 375.282720] Oops: 0000 [#1] SMP
[ 375.282956] Modules linked in: bonding x86_pkg_temp_thermal xhci_hcd zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) acpi_cpufreq video thermal processor fan backlight ipmi_si thermal_sys ipmi_msghandler hwmon battery button
[ 375.284686] CPU: 2 PID: 2744 Comm: txg_sync Tainted: P O 3.16.5-gentoo #1
[ 375.284788] Hardware name: Supermicro X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F, BIOS 1.1 07/19/2013
[ 375.294009] task: ffff8807ee3848e0 ti: ffff8807e4a50000 task.ti: ffff8807e4a50000
[ 375.294111] RIP: 0010:[] [] zap_create_claim+0x59/0x2d0 [zfs]
[ 375.294314] RSP: 0018:ffff8807e4a53bb8 EFLAGS: 00010296
[ 375.294411] RAX: 000000000000001d RBX: ffff8807fb6f1000 RCX: 0000000000000008
[ 375.294509] RDX: 000000000000001d RSI: 000000000000001c RDI: ffff8807fb6f1000
[ 375.294610] RBP: 000000000000001c R08: 0000000000000000 R09: 0000000000000002
[ 375.294708] R10: 0000000000000000 R11: 0000000100001d5c R12: 000000000000001d
[ 375.294807] R13: 000000000000001e R14: 0000000000000000 R15: 0000000000000000
[ 375.294908] FS: 0000000000000000(0000) GS:ffff88081fc80000(0000) knlGS:0000000000000000
[ 375.295013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 375.295112] CR2: 0000000000000018 CR3: 0000000001811000 CR4: 00000000001407e0
[ 375.295214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 375.295318] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 375.295419] Stack:
[ 375.295505] 0000000000000202 ffffffffa031362e ffff8807e3cb3108 0000000000000001
[ 375.295887] ffff8807ee3848e0 ffff8807e4fb5410 ffff8807e3cb3108 ffff8800d7edc000
[ 375.296270] 0000000000000000 ffff8807ecc6f280 ffff8800d7edc000 0000000000000000
[ 375.296655] Call Trace:
[ 375.296750] [] ? dmu_bonus_hold+0xfe/0x300 [zfs]
[ 375.296861] [] ? spa_feature_decr+0x4a/0xd0 [zfs]
[ 375.296972] [] ? dsl_scan_sync+0x82a/0xa30 [zfs]
[ 375.297081] [] ? spl_kmem_cache_free+0x11f/0x170 [spl]
[ 375.297187] [] ? spa_sync+0x4e2/0xb20 [zfs]
[ 375.297293] [] ? ktime_get_ts+0x3d/0xe0
[ 375.297395] [] ? txg_init+0x57a/0xa00 [zfs]
[ 375.297505] [] ? txg_init+0x250/0xa00 [zfs]
[ 375.297603] [] ? __thread_create+0x1d5/0x1f0 [spl]
[ 375.297702] [] ? __thread_create+0x160/0x1f0 [spl]
[ 375.297805] [] ? kthread+0xbc/0xe0
[ 375.297901] [] ? __kthread_parkme+0x70/0x70
[ 375.298001] [] ? ret_from_fork+0x7c/0xb0
[ 375.298100] [] ? __kthread_parkme+0x70/0x70
[ 375.298200] Code: fb 48 89 6c 24 40 48 89 f5 4c 89 64 24 48 49 89 d4 4c 89 6c 24 50 49 89 cd b9 08 00 00 00 4c 89 7c 24 60 4d 89 c7 4c 89 74 24 58 <41> 8b 70 18 44 89 4c 24 14 4c 8d 4c 24 28 49 8b 50 08 41 b8 01
[ 375.302980] RIP [] zap_create_claim+0x59/0x2d0 [zfs]
[ 375.303161] RSP
[ 375.303253] CR2: 0000000000000018
[ 375.303348] ---[ end trace 05a9ff27e7f91ba8 ]---

Packages are the same as above:

sys-fs/zfs-0.6.3-r2
sys-fs/zfs-kmod-0.6.3-r1
sys-kernel/spl-0.6.3-r1

Kernel is a different one:

Linux cens-backup 3.16.5-gentoo #1 SMP Tue Nov 11 09:20:39 EET 2014 x86_64 Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz GenuineIntel GNU/Linux

Best wishes,

Marko

@markovendelin
Copy link

And I can confirm that by masking

/etc/portage/package.mask
=sys-fs/zfs-kmod-0.6.3-r1
=sys-kernel/spl-0.6.3-r1

I can revert to the system that boots without Oops. Installed packages are

sys-fs/zfs-0.6.3-r2
sys-fs/zfs-kmod-0.6.3
sys-kernel/spl-0.6.3

Marko

@dharrop
Copy link

dharrop commented Dec 9, 2014

Similar Oops here as well after upgrading to:

sys-fs/zfs-kmod-0.6.3-r1
sys-kernel/spl-0.6.3-r1

Version 0.6.3 with additional repository patches applied.
Patch URL: http://dev.gentoo.org/~ryao/dist/zfs-0.6.3-patches-r1.tar.xz

Issue manifested after issuing zfs destroy. It didn't halt the system but degraded over time until a reboot was required. Couldn't restore the pool until I reverted kernel modules like Marko.

Linux nas1 3.16.5-gentoo #1 SMP Mon Dec 8 15:39:19 MST 2014 x86_64 Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz GenuineIntel GNU/Linux

gcc 4.8.3

[285511.517924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[285511.517955] IP: [] feature_do_action+0x23/0x2b0 [zfs]
[285511.517989] PGD 74ad3067 PUD 74ad6067 PMD 0
[285511.518005] Oops: 0000 [#1] SMP
[285511.518017] Modules linked in: xt_nat nfsd bonding iptable_nat nf_nat_ipv4 nf_nat vboxnetflt(O) vboxnetadp(O) vboxdrv(O) e1000e x86_pkg_temp_thermal coretemp crc32c_intel zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) sha256_generic fuse znvpair(PO) spl(O) multipath linear raid0 dm_raid raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid10 md_mod dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod
[285511.518172] CPU: 3 PID: 1621 Comm: txg_sync Tainted: P O 3.16.5-gentoo #2
[285511.518195] Hardware name: ASUS All Series/H87I-PLUS, BIOS 0306 04/15/2013
[285511.518216] task: ffff8804067a2c10 ti: ffff8800c63b4000 task.ti: ffff8800c63b4000
[285511.518238] RIP: 0010:[] [] feature_do_action+0x23/0x2b0 [zfs]
[285511.518273] RSP: 0018:ffff8800c63b7c28 EFLAGS: 00010286
[285511.518289] RAX: 000000000000001d RBX: ffff880409d25800 RCX: 000000000000001e
[285511.518310] RDX: 000000000000001d RSI: 000000000000001c RDI: ffff880409d25800
[285511.518331] RBP: ffff8800c63b7c80 R08: 0000000000000000 R09: 0000000000000002
[285511.518352] R10: ffff8800c63b7a90 R11: 0000000000000000 R12: 000000000000001c
[285511.518372] R13: 000000000000001d R14: 0000000000000002 R15: 0000000000000000
[285511.518393] FS: 0000000000000000(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
[285511.518417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[285511.518434] CR2: 0000000000000018 CR3: 0000000304b5d000 CR4: 00000000001427e0
[285511.518455] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[285511.518476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[285511.518496] Stack:
[285511.518503] ffff8803f8768b70 0000000000000000 0000000000000202 ffff8800c63b7c88
[285511.518528] ffffffffa01b419f ffffffffa01aaf69 ffff88040cc4e000 0000000000000000
[285511.518552] ffff88001f46b540 ffff88040cc4e000 0000000000000000 ffff8800c63b7cb0
[285511.518577] Call Trace:
[285511.518592] [] ? dmu_bonus_hold+0xcf/0x2d0 [zfs]
[285511.518616] [] ? dbuf_rele_and_unlock+0x179/0x250 [zfs]
[285511.518643] [] spa_feature_decr+0x44/0xb0 [zfs]
[285511.518670] [] dsl_scan_sync+0x7ba/0x9d0 [zfs]
[285511.518690] [] ? spl_kmem_cache_free+0x10f/0x170 [spl]
[285511.518716] [] ? zio_wait+0x12d/0x1c0 [zfs]
[285511.518742] [] spa_sync+0x4a2/0xae0 [zfs]
[285511.518761] [] ? autoremove_wake_function+0xd/0x30
[285511.518781] [] ? ktime_get_ts+0x43/0xe0
[285511.518807] [] txg_sync_thread+0x2e2/0x4d0 [zfs]
[285511.518833] [] ? txg_delay+0xf0/0xf0 [zfs]
[285511.518852] [] thread_generic_wrapper+0x6c/0x80 [spl]
[285511.518873] [] ? __thread_exit+0x20/0x20 [spl]
[285511.518893] [] kthread+0xc4/0xe0
[285511.518909] [] ? kthread_create_on_node+0x170/0x170
[285511.518929] [] ret_from_fork+0x7c/0xb0
[285511.518946] [] ? kthread_create_on_node+0x170/0x170
[285511.518965] Code: 00 00 00 00 00 0f 1f 00 55 48 89 d0 48 89 e5 41 57 4d 89 c7 41 56 45 89 ce 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <41> 8b 70 18 49 8b 50 08 44 89 4d c8 4c 8d 4d d0 48 89 4d c0 41
[285511.519072] RIP [] feature_do_action+0x23/0x2b0 [zfs]
[285511.519101] RSP
[285511.519111] CR2: 0000000000000018
[285511.528638] ---[ end trace c452a74fac0a0def ]---

Let me know if I can provide any further info.

  • Don

@alexanderhaensch
Copy link

same error here with 3.14.25-hardened-r1 and 0.6.3-r1 while deleting a tree of filesystems recursively.

I guess one of the backported patches is doing harm? I will retest with 0.6.3 and the same kernel.

https://bugs.gentoo.org/show_bug.cgi?id=532248

@ryao
Copy link
Contributor

ryao commented Dec 19, 2014

I was travelling when this issue was filed. These reports suggest that a regression might have occurred when I did my backports. It is possible that head is also affected given that the backports were strictly changes from head. I will look into this on the weekend.

@ryao
Copy link
Contributor

ryao commented Dec 19, 2014

Anyone affected by this should downgrade to sys-kernel/spl-0.6.3, sys-fs/zfs-kmod-0.6.3 and sys-kernel/zfs-0.6.3 while I work on a fix. My apologies for the inconvenience.

@alexanderhaensch
Copy link

In my case this is only happening on a dual CPU machine. On my hexacore AMD single CPU system it is not happening. The workload is surely smaller on the second system.

@devsk
Copy link

devsk commented Feb 1, 2015

Any updates on this issue? This is a blocker for us to move forward to kernels > 3.17. 3.18.x is already 5 releases old and I want to skip now EOL'ed 3.17.x series and move to 3.18.x, but I can't because of this bug and compilation errors on 3.18.x. The kernel series that I am on right now (3.16.x) is also EOLed upstream at 3.16.7. So, I need to move on. Please help.

@behlendorf
Copy link
Contributor

@devsk can you verify if this is a regression in the ZoL master branch or just in the Gentoo packages.

@alexanderhaensch
Copy link

Installed the current masters:
repository: https://github.com/zfsonlinux/spl.git
at the commit: 086476f
repository: https://github.com/zfsonlinux/zfs.git
at the commit: 33b4de5

and running some tests...... hope it works

@devsk
Copy link

devsk commented Feb 6, 2015

Brian: do you suggest I update my setup to master? Is it reasonably stable? 0.6.3 is super stable for me and the only reason I have to move is to get current with supported kernel. I know Turbo argues for master being more stable than 0.6.3 but I want to hear it from you...:)

Also, I know there is no guarantee of stability with any piece of software. It changes from install to install. So, may be this is all moot and I should just bite the bullet on master.

As for your question, I only wanted to stay 0.6.3 and yes, there, the 0.6.3-r1 gentoo release with extra patches on top of 0.6.3 was unstable.

@FransUrbo
Copy link
Contributor

Turbo argues for master being more stable than 0.6.3

I NEVER SAID THAT!! And the fact that you thought I just proves people doesn't read AND understand...

@alexanderhaensch
Copy link

I can confirm that the master is not stable on my system. After 10hours of excessive reading, the system became "slow". The load went through the roof over the 60s, services started to get non responsive.
Going back to stable..
I think if there will be a 'beta' or release candidate, i will try again :)

@FransUrbo
Copy link
Contributor

I can confirm that the master is not stable on my system.

Going back to stable..

I think if there will be a 'beta' or release candidate, i will try again :)

If Brian releases today, that version won't work for you either… Because you've already tried it (basically 'master' will be exactly like the 'released'). Or 'stable' as YOU call it.

And this is why I really, REALLY don't like to use the word 'stable' and why it should NEVER be used (unless it's actually PROVEN to be stable over a course of several months)!!

@alexanderhaensch
Copy link

I fully agree. Nothing is really stable, but some versions of software are
better tested than others.
0.6.3 worksforme :)
Am 07.02.2015 14:53 schrieb "Turbo Fredriksson" notifications@github.com:

I can confirm that the master is not stable on my system.

Going back to stable..

I think if there will be a 'beta' or release candidate, i will try again
:)

If Brian releases today, that version won't work for you either… Because
you've already tried it (basically 'master' will be exactly like the
'released'). Or 'stable' as YOU call it.

And this is why I really, REALLY don't like to use the word 'stable' and
why it should NEVER be used (unless it's actually PROVEN to be stable over
a course of several months)!!


Reply to this email directly or view it on GitHub
#2946 (comment).

@FransUrbo
Copy link
Contributor

I fully agree. Nothing is really stable, but some versions of software are
better tested than others.

Not at the time of release. 0.6.3 is just more tested NOW because more people
use is (and it's been frozen for over seven months)�

But if you want 0.6.4 to work for you (and others), you should really do some more
testing with it (possibly with extra PRs if there any that might help). Otherwise, you
will have to wait for 0.6.5...

@devsk
Copy link

devsk commented Feb 8, 2015

@alexanderhaensch : did you grab stacks of all the tasks when that happened? May be you ran into a known issue (like 3050).

I am reluctant to try master right now because I am sort of very tight on time on my work side and haven't got the cycles to troubleshoot and debug. I feel guilty about it but there are only 24 hours in a day....:(

@alexanderhaensch
Copy link

Unfortunatly i didn't catch any stacks.
I hope i have the time to setup a test system.
Is there a guide about what information are the most helpful?

@behlendorf
Copy link
Contributor

@devsk my suggestion would be to stick with the released version for your distribution for your normal usage, particularly if it's stable for your workload. What I was trying to determine by having you try the master branch is if this is a Gentoo specific regression or something which is also in the master branch.

Normally I'll assume that it's in master too until shown otherwise. But if I'm reading the comment thread correctly everyone who has reported this issue is running Gentoo and that's a little suspicious. So my question is, has anyone observed this on non-Gentoo platform?

@ivecera
Copy link

ivecera commented Feb 16, 2015

The problem seems to be in the patch 0021-Illumos-4390-I-O-errors-can-corrupt-space-map-when-d.patch from the sys-fs/zfs-kmod-0.6.3-r1 package.
The appropriate ZoL upstream commit uses new style spa_feature_* API but 0.6.3 version provides the old one. @ryao backported this commit so he changed the new API usage to old one but forgot one call of spa_feature_decr in dsl_scan.c.
The function is called as spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx); there but old API defines the function as void spa_feature_decr(spa_t *spa, zfeature_info_t *feature, dmu_tx_t *tx);.
SPA_FEATURE_ASYNC_DESTROY is enum with value 0 so the feature parameter is later dereferenced as NULL pointer.

The following patch should fix this problem in sys-fs/zfs-kmod-0.6.3-r1:

diff --git a/module/zfs/dsl_scan.c b/module/zfs/dsl_scan.c
index ec1378a..f70577d 100644
--- a/module/zfs/dsl_scan.c
+++ b/module/zfs/dsl_scan.c
@@ -1499,10 +1499,11 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
                }

                if (bptree_is_empty(dp->dp_meta_objset, dp->dp_bptree_obj)) {
+                       zfeature_info_t *feat = &spa_feature_table
+                           [SPA_FEATURE_ASYNC_DESTROY];
                        /* finished; deactivate async destroy feature */
-                       spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx);
-                       ASSERT(!spa_feature_is_active(spa,
-                           &spa_feature_table[SPA_FEATURE_ASYNC_DESTROY]));
+                       spa_feature_decr(spa, feat, tx);
+                       ASSERT(!spa_feature_is_active(spa, feat));
                        VERIFY0(zap_remove(dp->dp_meta_objset,
                            DMU_POOL_DIRECTORY_OBJECT,
                            DMU_POOL_BPTREE_OBJ, tx));

1 similar comment
@ivecera
Copy link

ivecera commented Feb 16, 2015

The problem seems to be in the patch 0021-Illumos-4390-I-O-errors-can-corrupt-space-map-when-d.patch from the sys-fs/zfs-kmod-0.6.3-r1 package.
The appropriate ZoL upstream commit uses new style spa_feature_* API but 0.6.3 version provides the old one. @ryao backported this commit so he changed the new API usage to old one but forgot one call of spa_feature_decr in dsl_scan.c.
The function is called as spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx); there but old API defines the function as void spa_feature_decr(spa_t *spa, zfeature_info_t *feature, dmu_tx_t *tx);.
SPA_FEATURE_ASYNC_DESTROY is enum with value 0 so the feature parameter is later dereferenced as NULL pointer.

The following patch should fix this problem in sys-fs/zfs-kmod-0.6.3-r1:

diff --git a/module/zfs/dsl_scan.c b/module/zfs/dsl_scan.c
index ec1378a..f70577d 100644
--- a/module/zfs/dsl_scan.c
+++ b/module/zfs/dsl_scan.c
@@ -1499,10 +1499,11 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
                }

                if (bptree_is_empty(dp->dp_meta_objset, dp->dp_bptree_obj)) {
+                       zfeature_info_t *feat = &spa_feature_table
+                           [SPA_FEATURE_ASYNC_DESTROY];
                        /* finished; deactivate async destroy feature */
-                       spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx);
-                       ASSERT(!spa_feature_is_active(spa,
-                           &spa_feature_table[SPA_FEATURE_ASYNC_DESTROY]));
+                       spa_feature_decr(spa, feat, tx);
+                       ASSERT(!spa_feature_is_active(spa, feat));
                        VERIFY0(zap_remove(dp->dp_meta_objset,
                            DMU_POOL_DIRECTORY_OBJECT,
                            DMU_POOL_BPTREE_OBJ, tx));

@behlendorf
Copy link
Contributor

@ivecera Thanks for getting to the bottom of this, @ryao since this is Gentoo specific could you comment on it.

@ari
Copy link

ari commented Mar 20, 2015

@ryao are you still working on an update to the Gentoo packaging to avoid this (pretty serious) bug

@ryao
Copy link
Contributor

ryao commented Apr 1, 2015

@ivecera Thanks. I will correct this in the next ebuild revision.

@alexanderhaensch
Copy link

@ryao i can report working with the 9999 ebuild for 2 weeks now. A new ebuild would be great!

@ari
Copy link

ari commented Apr 27, 2015

@ryao Likewise, I was about to move from Gentoo to ZFS on a better supported Linux, but the 9999 has worked well. I think it does ZFS a disservice on Gentoo to have a fundamental part of it broken for such a long time. Perhaps all users should be advised that only 9999 is usable at this time for recent kernels?

@alexanderhaensch
Copy link

@ari what distro is better supported? I was using 2 machines with ubuntu but after some problems with recent updates, i am testing Freebsd now :(

@ari
Copy link

ari commented Apr 27, 2015

Well, actually I'm a veteran (15 year and more) FreeBSD user. I have exactly one Linux box in all my server farms and I picked gentoo since it looked a little bit like FreeBSD :-) All this stuff "just works" on FreeBSD, partially because the ZFS implementation there is much older and more tested. It is also part of the core OS distribution so gets more eyes on it.

I'm still struggling to get Gentoo booting from ZFS, but I hope to get there eventually do that I can create and destroy ZFS pools with version 9999.

@alexanderhaensch
Copy link

As a 'veteran' (since 2.6.3= 11years) gentoo user i would not encourage you to use a out of kernel file-system on your root/boot. The main issue is the licensing as you might know.
The genkernel infrastructure is much too gnu like and everything that is not gnu is not really wanted. I never managed to get a smooth way to reinstall the out-of-kernel modules.

@FransUrbo
Copy link
Contributor

I'm still struggling to get Gentoo booting from ZFS, but I hope to get there eventually do that I can create and destroy ZFS pools with version 9999.

"I don't want to brag" (well, I guess I do after all :), but On Debian GNU/Linux Wheezy, this have "just worked" for quite some time. WITH a dedicated, proper installer with ZFS support.

And then, if Wheezy feels a little old, one could quite easily upgrade to Jessie, which was released yesterday…

@lkateley
Copy link

If you are running on freebsd I would recommend sending to freebsd-fs mail list.

Just a comment, but I still haven't been able to install zol on centos after days of trying. Piece of cake on debian/ubuntu. Freebsd or omni, it's just there.

@alexanderhaensch
Copy link

As the discussion is fully of topic now, i would encourage @ari using the master branch. It works well at the moment.

@behlendorf
Copy link
Contributor

In fact, let me close this issue out entirely. This issue never impacted the upstream ZFS on Linux source code from Github. It was a Gentoo specific issue which can be tracked downstream with Gentoo. It will be resolved when the Gentoo repositories are updated.

To avoid this sort of issue in the future I strongly suggest that the version of ZoL in Gentoo track the official upstream zfs-x.y.z-release branch. The sole intention of this branch will be to provide a current version of ZFS on Linux with only build fixes for new kernels, critical security fixes, and stability fixes.

@ari
Copy link

ari commented Apr 27, 2015

Thanks everyone for your help. I'd just like to confirm that version 9999 (which is hard masked) on Gentoo tracking the master branch, works perfectly for my use case. In particular I can now destroy pools.

I advise all gentoo users finding this thread to try that ebuild. It would be nice if Gentoo tracked the stable tags rather than master, but this will do for now.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

12 participants