Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
SPL PANIC when creating a pool on top of a Ceph RBD #241
Comments
hvenzke
commented
May 28, 2013
|
use the real physical name /dev/rbd1 |
tdb
commented
May 28, 2013
|
It makes no difference I'm afraid. The panic is identical. |
hvenzke
commented
May 28, 2013
|
Well , then the bug is at Ceph RBD ´s logic basicly as that provide the storange . ZFS on linux is known to work with native drbd fine. Ceph RBD snapshoot featgers are overkill as ZFS does that itsself. Can you try make an gfs cluster or lustre fs on it ? |
tdb
commented
May 29, 2013
|
Ceph RBD works fine with other file systems for me, and ZFS works fine with other underlying storage. So it's hard to be precise about where the problem lies. In any case, ZFS shouldn't panic, surely? That's a bug. Ceph provides a distributed file system which is why I want to use it. ZFS also has some great features for managing multiple file systems within a single pool including snapshots. |
|
@tdb You're hitting a VERIFY in the code while attempting to sync out the history buffer to disk. For some reason the buffer lengths aren't being correctly updated. Since this only happens on top of a ceph rbd I suspect their block device is behaving slightly differently that the rest of the Linux block drivers. For the purposes of a test you could try commenting out the VERIFY like this, although I my suspicion is you'll likely hit another issue quickly. However, that failure may shed some more light on exactly what's going wrong. diff --git a/module/zfs/spa_history.c b/module/zfs/spa_history.c
index 9fb75f3..2d45266 100644
--- a/module/zfs/spa_history.c
+++ b/module/zfs/spa_history.c
@@ -272,8 +272,8 @@ spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
NV_ENCODE_XDR, KM_PUSHPAGE) == 0);
mutex_enter(&spa->spa_history_lock);
- if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
- VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);
+// if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
+// VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);
/* write out the packed length as little endian */
le_len = LE_64((uint64_t)reclen);Related to this most people usually think about putting ceph on top over zfs not vise-versa. This behavior was recently fixed in master so you might try that. It won't get you features like distributed snapshots but it will bring many of zfs's other benefits. |
tdb
commented
Jun 7, 2013
|
@behlendorf Thanks for the reply. I made the change suggested (against 0.6.1) and saw the following: # zpool create pool1 /dev/rbd1 cannot open 'pool1': dataset does not exist So that's the same as before. Checking zpool status afterwards showed a good pool, but zfs status didn't show any filesystems. No panic though. Then I tried to repeat it. This time I got a panic after creating the pool, and zpool status hung. The panic was: [ 183.924160] divide error: 0000 [#1] SMP [ 183.924349] Modules linked in: coretemp(F) microcode(F) psmouse(F) ppdev(F) vmw_balloon(F) serio_raw(F) i2c_piix4(F) vmwgfx(F) mac_hid(F) ttm(F) shpchp(F) drm(F) parport_pc(F) rbd(F) libceph(F) lp(F) parport(F) zfs(POF) zcommon(POF) znvpair(POF) zavl(POF) zunicode(POF) spl(OF) floppy(F) e1000(F) mptspi(F) mptscsih(F) mptbase(F) btrfs(F) zlib_deflate(F) libcrc32c(F) [ 183.926033] CPU 0 [ 183.926100] Pid: 2019, comm: txg_sync Tainted: PF O 3.8.0-23-generic #34~precise1-Ubuntu VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform [ 183.926385] RIP: 0010:[] [] spa_history_write+0x82/0x1d0 [zfs] [ 183.926631] RSP: 0018:ffff88003c549ab8 EFLAGS: 00010246 [ 183.926742] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 183.926878] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000000 [ 183.927015] RBP: ffff88003c549b28 R08: ffff88003cfb4b40 R09: 0000000000000003 [ 183.927151] R10: ffff880037062303 R11: 316462722f766564 R12: ffff88003c496600 [ 183.927287] R13: ffff88003be36000 R14: ffff88003cf9a000 R15: 0000000000000008 [ 183.927424] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [ 183.927574] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 183.927690] CR2: 00007f3b12ef0000 CR3: 000000003b141000 CR4: 00000000000007f0 [ 183.927924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 183.928132] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 183.928312] Process txg_sync (pid: 2019, threadinfo ffff88003c548000, task ffff88003bc8ae80) [ 183.928535] Stack: [ 183.928633] 0000000000000002 ffffffffa01e3360 ffff88003cfb4b40 ffff88003c549ba0 [ 183.929007] ffff88003cf9a000 0000000000000008 ffff88003be36000 0000000068163d54 [ 183.929382] ffff88003b8a2cc0 ffff88003b8a2cc0 ffff88003be36000 ffff88003cfb4b40 [ 183.929757] Call Trace: [ 183.929903] [] spa_history_log_sync+0x221/0x610 [zfs] [ 183.930106] [] dsl_sync_task_group_sync+0x123/0x210 [zfs] [ 183.930312] [] dsl_pool_sync+0x41b/0x530 [zfs] [ 183.930507] [] spa_sync+0x3a8/0xa50 [zfs] [ 183.930667] [] ? ktime_get_ts+0x4c/0xe0 [ 183.930852] [] txg_sync_thread+0x2df/0x540 [zfs] [ 183.931049] [] ? txg_init+0x250/0x250 [zfs] [ 183.931219] [] thread_generic_wrapper+0x78/0x90 [spl] [ 183.931397] [] ? __thread_create+0x310/0x310 [spl] [ 183.931568] [] kthread+0xc0/0xd0 [ 183.936038] [] ? flush_kthread_worker+0xb0/0xb0 [ 183.936149] [] ret_from_fork+0x7c/0xb0 [ 183.936251] [] ? flush_kthread_worker+0xb0/0xb0 [ 183.936360] Code: 55 b0 48 89 fa 48 29 f2 48 01 c2 48 39 55 b8 0f 82 bc 00 00 00 4c 8b 75 b0 41 bf 08 00 00 00 48 29 c8 31 d2 49 8b b5 70 08 00 00 <48> f7 f7 4c 8d 45 c0 4c 89 f7 48 01 ca 48 29 d3 48 83 fb 08 49 [ 183.938433] RIP [] spa_history_write+0x82/0x1d0 [zfs] [ 183.938599] RSP [ 183.938710] ---[ end trace f7a46262c37aea79 ]--- If I had a more concrete idea of what was happening I'd be happy to file a bug with Ceph. |
|
Divide by zero, now that's interesting. Can you dump the exact code for your build as follows, it should look something like this but the exact line might differ. I want to know where that device by zero occurred.
|
tdb
commented
Jun 8, 2013
|
I've been building the module using dkms, but it appears to be stripping the module or not building it with symbols in the first place. Is there a way to modify that behaviour? Or am I going to need to ditch that and build it myself? I've tried setting the relevant things in /etc/default/zfs. |
|
Given previously failing VERIFY:
...static analysis suggests:
|
|
@tdb It depends on your kernel and what the default build options are. For example, the Ubuntu kernels will always strip the symbols. It may also not be needed since @chrisrd has likely spotted the offending line here. It seems likely that we're somehow reading bogus data from the ceph rbd. It would be useful to see what those values are. If you're still interested in chasing this can you try the following patch. It will log the offending value to the console before the crash. It would be useful to run it several times to see if the values remain constant or change.
|
tdb
commented
Jun 18, 2013
|
@behlendorf It looks like either through fiddling or other updates that I've managed to move the error: [ 422.936633] rbd1: unknown partition table [ 422.936705] rbd: rbd1: added with size 0x10000000000 [ 441.362250] SPL: using hostid 0x007f0101 [ 441.470098] SPLError: 1682:0:(zap_micro.c:301:mze_find()) VERIFY3(mze->mze_cd == (&(zn->zn_zap)->zap_u.zap_micro.zap_phys->mz_chunk[(mze)->mze_chunkid])->mze_cd) failed (0 == 1635019877) [ 441.470418] SPLError: 1682:0:(zap_micro.c:301:mze_find()) SPL PANIC [ 441.470544] SPL: Showing stack for process 1682 [ 441.470552] Pid: 1682, comm: txg_sync Tainted: PF O 3.8.0-25-generic #37~precise1-Ubuntu [ 441.470554] Call Trace: [ 441.470579] [] spl_debug_dumpstack+0x27/0x40 [spl] [ 441.470589] [] spl_debug_bug+0x82/0xe0 [spl] [ 441.470636] [] mze_find+0x13a/0x270 [zfs] [ 441.470677] [] zap_lookup_norm+0x9e/0x1c0 [zfs] [ 441.470685] [] ? kmem_free_debug+0x4b/0x150 [spl] [ 441.470725] [] zap_lookup+0x33/0x40 [zfs] [ 441.470765] [] spa_feature_is_active+0x8a/0xf0 [zfs] [ 441.470799] [] dsl_scan_active+0x76/0xc0 [zfs] [ 441.470833] [] dsl_scan_sync+0x4f/0xe30 [zfs] [ 441.470873] [] ? zio_wait+0x23d/0x480 [zfs] [ 441.470910] [] ? bpobj_enqueue_cb+0x20/0x20 [zfs] [ 441.470947] [] spa_sync+0x417/0xcd0 [zfs] [ 441.470968] [] ? ktime_get_ts+0x4c/0xe0 [ 441.471007] [] txg_sync_thread+0x30a/0x640 [zfs] [ 441.471016] [] ? kmem_free_debug+0x4b/0x150 [spl] [ 441.471054] [] ? txg_quiesce_thread+0x540/0x540 [zfs] [ 441.471062] [] thread_generic_wrapper+0x78/0x90 [spl] [ 441.471070] [] ? __thread_create+0x310/0x310 [spl] [ 441.471080] [] kthread+0xc0/0xd0 [ 441.471084] [] ? flush_kthread_worker+0xb0/0xb0 [ 441.471096] [] ret_from_fork+0x7c/0xb0 [ 441.471100] [] ? flush_kthread_worker+0xb0/0xb0 If that's of no use to you, let me know and I'll try to get the machine back how it was. I notice the kernel version has changed, and I'm fairly sure a ceph update got pulled in too. |
|
@tdb This just looks like garbage data from disk as well. One thing which did catch my eye however from the above log was the size of the rbd device. 0x10000000000 is a surprisingly round number for the partition, is this expected? Also are you creating a partition table for zfs manually, or allowing it to partition the device?
|
tdb
commented
Jun 19, 2013
|
@behlendorf I noticed that size too. It's a 1GB partition, so it's actually correct. # rbd ls -l NAME SIZE PARENT FMT PROT LOCK cephzfs 1024G 1 I was giving the raw device to ZFS, rather than creating a partition. If I use fdisk to but a partition table on the disk, but without adding any partitions, I get the following when creating a pool: # zpool create pool1 /dev/rbd1 internal error: Invalid argument Aborted (core dumped) If I create a partition on it I get the same errors as I mentioned previously (mze_find) when creating a pool on /dev/rbd1p1. Just for comparison, here's the output creating an ext4 filesystem on the same partition: root@ubuntu:~# mkfs.ext4 /dev/rbd1p1
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
67108864 inodes, 268434432 blocks
13421721 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
8192 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
root@ubuntu:~# mount /dev/rbd1p1 /mnt
root@ubuntu:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/rbd1p1 1008G 72M 957G 1% /mnt
root@ubuntu:~#
|
|
Strange. Well the only way these failures make sense is if something odd is happening at the block device layer. My next suggestion would be to use blktrace to grab a trace log for the rbd device. That would allow us to look for something unusual in the way the rbd or zfs is behaving. http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html |
tdb
commented
Jun 20, 2013
|
@behlendorf Does this output help? |
|
@tdb That's exactly the log I wanted to see, but unfortunately it doesn't really show anything strange. All the I/O looks reasonable and is doing what I'd expect a That's got me wondering if the rbd driver might be modifying parts of the pages in the bvecs during the write. That could explain this issue, but we'd need to put a debug patch together to see. |
|
@TBD Based on little more than the mention of modifying bvecs, this commit which touches
If your kernel doesn't have that patch already it could be worthwhile trying a kernel including it. It looks to have been introduced some time between v3.9 and v3.10-rc1. Possibly even worth trying v3.10-rc6 which has pulled in a bunch of |
tdb
commented
Jun 21, 2013
|
@chrisrd Using the Ubuntu mainline kernels I tried v3.9.7, but it behaved the same. I checked and it doesn't cotain the commit you mentioned above. So I tried v3.10-rc6 and I get the following build error in spl: Making all in module
make[2]: Entering directory `/var/lib/dkms/spl/0.6.1/build/module'
make -C /lib/modules/3.10.0-031000rc6-generic/build SUBDIRS=`pwd` CONFIG_SPL=m modules
make[3]: Entering directory `/usr/src/linux-headers-3.10.0-031000rc6-generic'
CC [M] /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-debug.o
CC [M] /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.o
In file included from /var/lib/dkms/spl/0.6.1/build/include/sys/kmem.h:38:0,
from /var/lib/dkms/spl/0.6.1/build/include/sys/kstat.h:32,
from /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:28:
/var/lib/dkms/spl/0.6.1/build/include/sys/vmsystm.h:77:8: error: redefinition of ‘struct vmalloc_info’
include/linux/vmalloc.h:173:8: note: originally defined here
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_match’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1126:15: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1129:32: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_find’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1137:16: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1137:37: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entries’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1150:16: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1150:37: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘spl_proc_init’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1177:2: error: implicit declaration of function ‘create_proc_entry’ [-Werror=implicit-function-declaration]
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1177:21: warning: assignment makes pointer from integer without a cast [enabled by default]
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1181:27: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_match’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1130:1: warning: control reaches end of non-void function [-Wreturn-type]
cc1: some warnings being treated as errors
make[5]: *** [/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.o] Error 1
make[4]: *** [/var/lib/dkms/spl/0.6.1/build/module/spl] Error 2
make[3]: *** [_module_/var/lib/dkms/spl/0.6.1/build/module] Error 2
make[3]: Leaving directory `/usr/src/linux-headers-3.10.0-031000rc6-generic'
make[2]: *** [modules] Error 2
make[2]: Leaving directory `/var/lib/dkms/spl/0.6.1/build/module'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/lib/dkms/spl/0.6.1/build'
make: *** [all] Error 2
Have spl/zfs been tested with v3.10 yet? |
|
@tdb There are pull requests open for 3.10 support by they are still under going review before getting merged. They should be safe to use, the only real questions around them are do they accidentally break builds on older kernels and are they as clean as they can be. @chrisrd I don't think the referenced commit will help, but it wouldn't hurt to try. We'll probably need to instrument the zfs |
tdb
commented
Aug 25, 2013
|
Just a quick update on this. I've tried again with 0.6.2 and the following two kernels: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10.9-saucy/ Same problem: Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.393672] SPLError: 2851:0:(zap_micro.c:301:mze_find()) VERIFY3(mze->mze_cd == (&(zn->zn_zap)->zap_u.zap_micro.zap_phys->mz_chunk[(mze)->mze_chunkid])->mze_cd) failed (0 == 825307184) Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394034] SPLError: 2851:0:(zap_micro.c:301:mze_find()) SPL PANIC Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394160] SPL: Showing stack for process 2851 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394164] CPU: 0 PID: 2851 Comm: txg_sync Tainted: PF O 3.11.0-031100rc6-generic #201308181835 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394166] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/22/2012 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394169] ffff88003c59da00 ffff88003c4ab9c8 ffffffff81720b9b 0000000000000007 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394173] 0000000000000000 ffff88003c4ab9d8 ffffffffa018f4d7 ffff88003c4aba18 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394176] ffffffffa01907a2 ffffffffa01a4b4d ffff880036998880 ffff88003c59da00 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394179] Call Trace: Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394203] [] dump_stack+0x46/0x58 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394221] [] spl_debug_dumpstack+0x27/0x40 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394246] [] spl_debug_bug+0x82/0xe0 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394314] [] mze_find+0x13a/0x270 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394359] [] zap_lookup_norm+0x9e/0x1c0 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394368] [] ? kmem_free_debug+0x4b/0x150 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394410] [] zap_lookup+0x33/0x40 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394451] [] spa_feature_is_active+0x8a/0xf0 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394485] [] dsl_scan_active+0x76/0xc0 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394520] [] dsl_scan_sync+0x4f/0xe30 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394559] [] ? zio_wait+0x23d/0x4a0 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394596] [] ? bpobj_enqueue_cb+0x20/0x20 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394633] [] spa_sync+0x48a/0xd60 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394649] [] ? ktime_get_ts+0x4c/0xe0 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394687] [] txg_sync_thread+0x30a/0x640 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394696] [] ? kmem_free_debug+0x4b/0x150 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394733] [] ? txg_quiesce_thread+0x540/0x540 [zfs] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394742] [] thread_generic_wrapper+0x78/0x90 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394750] [] ? __thread_create+0x310/0x310 [spl] Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394759] [] kthread+0xc0/0xd0 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394763] [] ? flush_kthread_worker+0xb0/0xb0 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394771] [] ret_from_fork+0x7c/0xb0 Aug 25 00:51:29 ubuntu-12042 kernel: [ 142.394776] [] ? flush_kthread_worker+0xb0/0xb0 |
tdb
commented
Nov 21, 2013
|
Using 0.6.2 and the linux-image-generic-lts-saucy 3.11.0.13.12 kernel on Ubuntu precise I now get the following: # zpool create pool2 /dev/rbd1 internal error: Invalid argument Aborted (core dumped) The core file contains: #0 0x00007ffa1abad425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffa1abb0b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007ffa1b383782 in ?? () from /lib/libzfs.so.2 #3 0x00007ffa1b383b70 in zfs_standard_error_fmt () from /lib/libzfs.so.2 #4 0x00007ffa1b364a1e in zfs_open () from /lib/libzfs.so.2 #5 0x000000000040bc98 in zpool_do_create (argc=, argv=) at ../../cmd/zpool/zpool_main.c:1057 #6 0x0000000000404d26 in main (argc=4, argv=0x7fffecdc5178) at ../../cmd/zpool/zpool_main.c:5709 And this in the log: Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240529] SPLError: 1688:0:(spa.c:6190:spa_sync()) VERIFY3(bpobj_iterate(defer_bpo, spa_free_sync_cb, zio, tx) == 0) failed (22 == 0) Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240786] SPLError: 1688:0:(spa.c:6190:spa_sync()) SPL PANIC Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240899] SPL: Showing stack for process 1688 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240910] CPU: 0 PID: 1688 Comm: txg_sync Tainted: PF O 3.11.0-13-generic #20~precise2-Ubuntu Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240912] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240915] 0000000000000005 ffff88003c6f9c48 ffffffff8173a05d 0000000000000007 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240919] 0000000000000000 ffff88003c6f9c58 ffffffffa01794d7 ffff88003c6f9c98 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240922] ffffffffa017a7a2 ffffffffa018ebed ffff88003b804000 0000000000000005 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240925] Call Trace: Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240943] [] dump_stack+0x46/0x58 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240971] [] spl_debug_dumpstack+0x27/0x40 [spl] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.240979] [] spl_debug_bug+0x82/0xe0 [spl] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241024] [] spa_sync+0x9f7/0xdb0 [zfs] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241080] [] txg_sync_thread+0x364/0x6a0 [zfs] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241122] [] ? txg_quiesce_thread+0x520/0x520 [zfs] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241131] [] thread_generic_wrapper+0x78/0x90 [spl] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241139] [] ? __thread_create+0x310/0x310 [spl] Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241145] [] kthread+0xc0/0xd0 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241149] [] ? flush_kthread_worker+0xb0/0xb0 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241158] [] ret_from_fork+0x7c/0xb0 Nov 21 23:08:22 ubuntu-12042 kernel: [ 116.241162] [] ? flush_kthread_worker+0xb0/0xb0 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.848936] INFO: task txg_sync:1688 blocked for more than 120 seconds. Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849220] txg_sync D ffff880036a5ece0 0 1688 2 0x00000000 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849226] ffff88003c6f9c48 0000000000000046 ffffffff81ae70b3 ffff88003fc14580 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849230] ffff88003c6f9fd8 ffff88003c6f9fd8 ffff88003c6f9fd8 0000000000014580 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849233] ffff88003cd69770 ffff88003cd6aee0 0000000000000000 0000000000000000 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849236] Call Trace: Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849252] [] schedule+0x29/0x70 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849299] [] spl_debug_bug+0xb5/0xe0 [spl] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849346] [] spa_sync+0x9f7/0xdb0 [zfs] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849387] [] txg_sync_thread+0x364/0x6a0 [zfs] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849427] [] ? txg_quiesce_thread+0x520/0x520 [zfs] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849445] [] thread_generic_wrapper+0x78/0x90 [spl] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849454] [] ? __thread_create+0x310/0x310 [spl] Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849460] [] kthread+0xc0/0xd0 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849464] [] ? flush_kthread_worker+0xb0/0xb0 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849468] [] ret_from_fork+0x7c/0xb0 Nov 21 23:10:27 ubuntu-12042 kernel: [ 240.849471] [] ? flush_kthread_worker+0xb0/0xb0 Further zpool commands generate the following: Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182141] SPLError: 2064:0:(zap_micro.c:1292:zap_cursor_retrieve()) VERIFY3(mze->mze_cd == mzep->mze_cd) failed (0 == 1635019877) Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182264] SPLError: 2064:0:(zap_micro.c:1292:zap_cursor_retrieve()) SPL PANIC Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182329] SPL: Showing stack for process 2064 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182335] CPU: 0 PID: 2064 Comm: zpool Tainted: PF O 3.11.0-13-generic #20~precise2-Ubuntu Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182338] ffff88003c25b640 ffff88003bdebac8 ffffffff8173a05d 0000000000000007 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182341] 0000000000000000 ffff88003bdebad8 ffffffffa01794d7 ffff88003bdebb18 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182343] ffffffffa017a7a2 ffffffffa018ebed ffff88003bdebbf8 ffff88003c25b640 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182345] Call Trace: Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182352] [] dump_stack+0x46/0x58 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182363] [] spl_debug_dumpstack+0x27/0x40 [spl] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182367] [] spl_debug_bug+0x82/0xe0 [spl] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182400] [] zap_cursor_retrieve+0x24a/0x480 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182414] [] ? default_spin_lock_flags+0x9/0x10 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182441] [] ? zap_unlockdir+0x108/0x1a0 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182466] [] spa_add_feature_stats+0x213/0x440 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182471] [] ? kmem_alloc_debug+0x138/0x3b0 [spl] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182476] [] ? kmem_alloc_debug+0x138/0x3b0 [spl] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182482] [] ? nvlist_remove_all+0x8f/0xd0 [znvpair] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182506] [] ? spa_config_held+0xb9/0xd0 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182531] [] ? spa_add_l2cache+0x29/0x3f0 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182555] [] ? spa_add_spares+0x25/0x360 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182579] [] spa_get_stats+0x10f/0x330 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182584] [] ? kmem_alloc_debug+0x138/0x3b0 [spl] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182610] [] zfs_ioc_pool_stats+0x31/0x70 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182636] [] zfsdev_ioctl+0x53b/0x5b0 [zfs] Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182646] [] ? ftrace_raw_event_do_sys_open+0x100/0x110 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182651] [] do_vfs_ioctl+0x7c/0x2f0 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182653] [] SyS_ioctl+0x91/0xb0 Nov 21 23:17:44 ubuntu-12042 kernel: [ 678.182657] [] system_call_fastpath+0x1a/0x1f |
|
@tdb This was due to an ABI change between the user utilities and kmods. It is unrelated to your original issue. Make sure you rebuild everything such that the utilities exactly match the kmods. |
tdb
commented
Nov 22, 2013
|
@behlendorf Ah, ok, sorry for the noise. I actually just installed the binary packages from launchpad, which built the kmods with dkms. So I would have expected that to stay in sync. Anyway - as you say, not related to this issue. |
rbraddy
commented
Dec 18, 2013
|
Is this still an issue or has a resolution been found? |
tdb
commented
Dec 18, 2013
|
@rbraddy I'm not aware of a resolution yet, but @behlendorf can confirm. Are you seeing the same problem? It'd be good to know it's not just me! |
rbraddy
commented
Dec 18, 2013
|
Yes, we are seeing the same issue with creating ZFS storage pool atop of Ceph RDB block device - zpool create failure and kernel panic. Creating ext4 filesystem on RDB works perfectly. RBD is used extensively today by various cloud stacks (e.g., Open Stack, Cloud Stack and others), so there seems to be no issue with how it presents itself as a block device for those file systems. Having RDB work well with ZFS is very important, as it addresses one of the major drawbacks to ZFS - a single point of failure on direct-attached storage, plus the ability to scale out. RADOS is very impressive technology, and combined with ZFS promises to be the most powerful filesystem around. Ceph's filesystem is not ready for prime time, so it just makes sense for these two technologies to work well together and be supported (the way ZFS is supported underneath Ceph OSD's today). I agree that it's odd that: a) ZFS is the only major file system that is not working atop of RDB today, and b) ZFS panics instead of failing gracefully in the face of whatever incompatibility exists. Having said that, from what I have seen, ZFS does work a bit differently than many other file systems. In our testing, we also encountered strange behavior by PARTED when trying to delete an existing ext4 partition that we initially configured atop of RDB, in an attempt to create an empty GPT partition in preparation for use with ZFS. ZFS creates its own partitioning scheme from what I have seen, so this may be a clue. We are still investigating, but at this point lack the deep kernel expertise required to reconcile the issue between these two complex systems. In reading through this thread, about six months ago, I see Brian proposed something as a next step that does not appear to have occurred yet, to gather more information as a next step. I'm wondering if it makes sense to pursue that line of analysis next: From @behlendorf : It seems likely that we're somehow reading bogus data from the ceph rbd. It would be useful to see what those values are. If you're still interested in chasing this can you try the following patch. It will log the offending value to the console before the crash. It would be useful to run it several times to see if the values remain constant or change. diff --git a/module/zfs/spa_history.c b/module/zfs/spa_history.c
index 9fb75f3..700f364 100644
--- a/module/zfs/spa_history.c
+++ b/module/zfs/spa_history.c
@@ -223,6 +223,13 @@ spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
*/
VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
shpp = dbp->db_data;
+#ifdef _KERNEL
+ printk("sh_pool_create_len = %llu\n", shpp->sh_pool_create_len);
+ printk("sh_phys_max_off = %llu\n", shpp->sh_phys_max_off);
+ printk("sh_bof = %llu\n", shpp->sh_bof);
+ printk("sh_eof = %llu\n", shpp->sh_eof);
+ printk("sh_records_losts = %llu\n", shpp->sh_records_lost);
+#endif
dmu_buf_will_dirty(dbp, tx); |
|
I just wanted to post a note here to say that I've started actively looking into this problem. I'm occasionally able to reproduce similar problems as the original report but my general observation is that any other forms of chaos can seem to result from running ZFS atop RDB. Unfortunately, I got sidetracked while looking into this and burned a ton of time tracking down the problem described in zfsonlinux/zfs#2010. With that out of the way, hopefully I'm back on track now. Also, I should mention that this should likely be a ZFS issue rather than an SPL issue. |
hvenzke
commented
Dec 30, 2013
|
@dweeezil Tim , some of the logs i have read about this with zfs+ Ceph RDB said that the ZFS ´s used partion table not supported by PARTED ?!??
3 . did you tried fdisk on the Ceph RDB disk , type "bf" usage ? uppon my ZFS skills BF are the default , someone may allowed to correct me .-) |
|
I'm still trying to get a grip on the actual problem. So far, I'm fairly certain the problem is not simply that the rbd block device behaves differently than do block devices. @remsnet For my current testbed, I'm generally creating my ZFS pool on a single pre-created partition on the rbd device (actually, my preferred testbed is to dd a known good pool on to my rbd and test from there). I'm hoping to narrow down the problem a bit more within the next day or so once I get more time to look at it. The failures I'm seeing when performing normal filesystem operations are many and varied. I'm concerned that zfs+rbd is exceeding Linux's kernel stack limit but I've not been able to prove it. I do plan on building a 16K stack kernel as part of my further testing to try to rule it out. Using debugfs' stack_trace feature has been very iffy with wild pointer (NULL or close-to-null) dereferences typically occurring in the ftrace_call() function, itself. I also plan on doing some instrumenting of rbd by itself to get a handle on its "base" stack utilization. The failures I'm seeing are typical of those you'd see when memory (the stack in particular) is overwritten. I'll post more information as I get it it. |
|
@dweeezil It's great to see you looking in to this. Stack overun is certainly one possible explanation for this, I could easily believe that the ceph rbd is more stack heavy that other block devices in the kernel. As you said rebuilding your kernel with 16k stacks would be the easiest way to check for this. |
|
I realize a quick update on this may be in order since my last communication has been directly to @rbraddy off-list. I've pretty much ruled out stack overflow and, in fact, the stack usage seems to be very well controlled with my zpool commands being the greatest consumer but still having over 11K free when using a kernel with 16K stacks. What I'm seeing when trying simple ZPL operations on a known-good pool (dd'ed to a Ceph RBD and md5-checksumed for verification) is that somewhat random memory corruption. I briefly experimented with kmemcheck but I don't think it's going to be too helpful; the kernel runs too slowly and it has a bad habit of finding false positives. Instead, I'm trying to find a way to reliably trip specific types of assertions. My best success has been that it's very easy to corrupt the microZAPs when creating small files. In my last bit of detective work, it looks like they're being clobbered by, of call things, the in-memory debugging output from the various dprintf...() debugging functions. I've taken a few days off of looking at this mainly due to my discovery of the zfsonlinux/zfs#2010. I'll post more information on this issue as it becomes available. |
hvenzke
commented
Jan 7, 2014
|
last real kernel stackoverrun i heard was when Linus left the kernel serries 1.x alone |
|
@dweeezil What version of kernel rbd are you using? A whole bunch of rbd changes, some of which could conceivably cause corruptions, hit master in v3.12-rc2 and are slated for the next round of stable updates (v3.10.26 etc). |
|
@chrisrd Since I'm doing my testing under Ubuntu 13.10 I'm running their stock 3.11.0-14-generic kernel compiled by myself and fetched from their git repo. The last commit in the fs/ceph directory is 494ddd1 from the upstream. I did think about running a new kernel (even a stock upstream) but decided to try to stay as standard as possible for the time being. I did also checkout the commit logs going forward until the latest master code in Linus' tree and none of them looked like total show-stoppers to me, but that was just after a cursory inspection. Should I be running something newer? |
|
@dweeezil Given you're using rbd, the relevant kernel parts would be drivers/block/rbd* net/ceph include/linux/ceph rather than fs/ceph (that's for the ceph file system, not directly related to rbd, except they both use the ceph underpinnings). But I agree, looking at the logs from torvalds/linux@494ddd1 onwards doesn't show anything that looks particularly suspicious, except perhaps if you're using rbd snapshots:
|
|
@chrisrd Thanks for the confirmation. Yes, that was a booboo for me to refer to fs/ceph rather than drivers/block/rbd* (which is actually what I was looking at the first time I scanned the commits a week ago). In any case, it seems that I should be safe staying with the kernel I'm working with. |
|
It seems the rbd driver will overwrite memory in some cases when passed odd-sized bio requests from ZFS. I gave up trying to track down the memory corruption because I couldn't get anything reproducible so I compared ZFS' block I/O to that of other file systems (ext4 and xfs) and discovered that the others only presented nice evenly-sized requests (usually 8 sectors or multiples of 8) but that ZFS presents "unusual" sizes such as 5, 13 and 15 sectors. Apparently the rbd driver doesn't like some aspect of these requests and the result is memory corruption. I've not yet determined whether the transfer size or the buffer alignment is the problem. In any case, a trivial workaround is to use an ashift 12 pool. They appear to work just fine on rbd and present it with nice evenly-sized bios. I've asked @rbraddy to do some testing on his Ceph rigs. |
|
I'm sure the Ceph people would like to know about this: would you like to enter whatever details you have into the ceph bug tracker? |
|
@dweeezil That makes good sense since most, maybe all, Linux filesystems will always submit IO in 4k page sized chunks from the page cache. I could easily see an alignment issue like in the ceph rbd not being caught. |
dweeezil
referenced this issue
in zfsonlinux/zfs
Mar 18, 2014
Closed
Kernel panic when creating pools on CEPH RBD devices #2193
aieri
commented
Mar 19, 2014
|
@dweeezil if anybody who has a fair understanding about what's going on could submit a bug on the CEPH tracker I would be happy to help push things along. We have a support contract with Inktank and could bring this issue to their attention. |
|
Issue opened on ceph bug tracker: http://tracker.ceph.com/issues/7790 |
aieri
commented
Mar 28, 2014
|
Unfortunately the word from Inktank is that since ZFS on RBD is not supported, they will not handle the bug as part of our support contract, but it will simply be a community effort. |
|
@aieri I don't think that simply misaligning standard filesystems will make a difference. With ext4, maybe try |
|
You shouldn't be able to cause this with filesystem like ext or xfs since they will be strictly page aligned (4k). We could probably work around it fairly easily on the ZFS side by detecting the it's a ceph rbd and automatically setting the ashift to 12. That said it would be better to fix this on the ceph side of course. |
|
I never did determine whether it was the alignment or the transfer size that caused the problem. It would be nice to detect Ceph RBD under ZFS and set ashift=12 as a workaround. |
|
Hi all: At first with spl-master, zfs-master and linux-3.14.1 Then I tried again with the following patch: I still haven't figure out why, but it seems that rbd cannot handle vmalloc'd buffer properly. |
|
@tuxoko Very interesting! Have you tried setting "c" to 16 if it were less and continue to let it use |
|
Hi @dweeezil Then, I tried with the below one: So I'm now more inclined to believe that there's a subtle bug in spl slab that happen to show up 100% when using 512 byte slab with ashift=9 on rbd. By the way, I get |
behlendorf
referenced this issue
in zfsonlinux/zfs
Apr 17, 2014
Closed
Fix zfsdev_ioctl() kmem leak warning #2262
|
So I noticed this in zio.c:
I commented out this part, and the result is also successful. There's definitely a bug in spl slab especially with KMC_KMEM Update: |
|
@tuxoko Wondering if there could be a tie-in to this code in
When I was trying to track this down, I did all sorts of hackery to |
|
I'm a bit confused. The real mystery is that at first I thought there's a problem with vmalloc, so I use kmalloc and it worked. But it turned out that the it wasn't using vmalloc in the beginning and change it to vmalloc also "fixed" it. |
|
@dweeezil |
|
@tuxoko I should be able to get to this later this evening. The most interesting clue I can remember from my previous research into the problem is that the data being overwritten is frequently the in-memory SPL log (the one you can view after triggering the |
|
@tuxoko @dweeezil Sorry I've been slow to comment on this but I think I can shed some light on the code referenced above. The zio buffers will be a mix of kmem and vmem backed memory depending on the size of the object. We want to minimize the overhead of an allocation so we avoid vmem backed buffers if possible. Ideally they should all be backed by individual pages which would make them simpler to hand off to the block layer which expects individual pages. That's why we need the Now exactly why small slab objects would cause a problem isn't clear to me. The only thing which immediately occurs to me is that they are likely to share pages in the slab. These pages will be mapped in to the bio and perhaps that leading to an issue. |
|
Unfortunately, I've let my ceph testing rig go stale and it's not working right now so I can't easily do any more testing on this issue for the moment. |
|
I built a DEBUG_PAGEALLOC kernel and got the following result with ashift=9 on unpatched master
This oops can be reproduced very reliably, which is a bit of a surprise since the buffer is slab backed. |
|
This is the dump_stack where the page is freed.
So rbd/network call put_page on zfs buffer and result in this bug. |
|
@aieri @tdb
|
tdb commentedMay 24, 2013
I'm trying to create a pool on top of a Ceph RBD. My setup is:
I can create a pool on top of a local disk without any problems. But when I put it on top of a Ceph RBD (block device) I get the following error:
And this panic:
And then the following repeats after that until I reboot:
I'm happy to provide any further information required or do testing as needed.
Thank you.
Tim.