zpool import of raidz1 volume freezes #1488

Closed
ewheelerinc opened this Issue May 31, 2013 · 16 comments

Projects

None yet

4 participants

@ewheelerinc

Hello,

I have a 4-disk raidz1 pool with a zvol as zd0 that I have been testing with. Generally speaking the system has been very solid, but an import now freezes the system hard without any dmesg output.

I dd'ed the four 500mb disk images and put them in this 900MB tar file:
http://www.globallinuxsecurity.pro/tmp/zpool-zvol-import-freeze-phydisk1-4.tar.gz

( I will leave this up for a while, hopefully long enough for someone to download and test. There is nothing confidential on this image, just a debian install.)

For reference, I was able to freeze the system in import today as of commit 0377189 . Note that "zpool import" does list the available pools, but when I specify the pool to import (or -a) it hangs at that time.

Hopefully someone can reproduce the problem with these volumes and pinpoint the issue. Unfortunately the kernel gives no output when it locks up---it just freezes hard.

This is what vmstat looks like at the time of import. Hopefully the amount of data read can give you a guess as to what is happening at the moment it locks up:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0 414920   6948  18620    0    0     0     0  469   21  0  0 100  0
 0  0      0 414944   6948  18620    0    0     0     8  478   36  0  1 99  0
 0  0      0 414944   6948  18620    0    0     0     0  471   23  1  0 99  0
 0  0      0 414960   6956  18620    0    0     0    36  491   43  0  1 98  1
 0  1      0 413524   6964  18620    0    0 13410    32 1177  593  1 18 61 20
 1  1      0 413696   6964  18620    0    0  9376     0  866 1171  0 71  0 29
 0  1      0 413540   6964  18620    0    0  6624    96  887 1133  1 67  0 32
 0  1      0 406592   6964  18620    0    0  4272  1056  850  535  0 30  0 70
 0  1      0 406592   6964  18620    0    0     0   480  699  189  0 16  0 84
 1  1      0 406592   6964  18620    0    0     0   480  738  216  0 21  0 79
 0  1      0 406592   6972  18620    0    0     0  1956  666  161  0 28  0 72
 0  1      0 406452   6984  18624    0    0     4   772  647  249  0 17  0 83
 2  1      0 406452   6984  18624    0    0     0   496  678  174  0 18  0 82
 0  1      0 406476   6984  18624    0    0     0   488  816  263  0 23  0 77
Write failed: Broken pipe
Connection to 192.168.102.53 closed.
@ewheelerinc

Sys-req T output

[   91.492385] SysRq : Show State
[   91.495465]   task                PC stack   pid father
[   91.500685] init            R running      0     1      0 0x00000000
[   91.507239] kthreadd        S c03785b4     0     2      0 0x00000000
[   91.513668] ksoftirqd/0     S c03785b4     0     3      2 0x00000000
[   91.520124] kworker/0:0     S c03785b4     0     4      2 0x00000000
[   91.526559] kworker/0:0H    S c03785b4     0     5      2 0x00000000
[   91.532989] kworker/u:0     S c03785b4     0     6      2 0x00000000
[   91.539420] kworker/u:0H    S c03785b4     0     7      2 0x00000000
[   91.545850] khelper         S c03785b4     0     8      2 0x00000000
[   91.552279] kdevtmpfs       S c03785b4     0     9      2 0x00000000
[   91.558720] netns           S c03785b4     0    10      2 0x00000000
[   91.565153] kworker/0:1     R running      0    11      2 0x00000000
[   91.571579] bdi-default     S c03785b4     0    12      2 0x00000000
[   91.578059] kblockd         S c03785b4     0    13      2 0x00000000
[   91.584493] khubd           S c03785b4     0    14      2 0x00000000
[   91.590949] rpciod          S c03785b4     0    15      2 0x00000000
[   91.597382] khungtaskd      S c03785b4     0    16      2 0x00000000
[   91.603843] kswapd0         S c03785b4     0    17      2 0x00000000
[   91.610281] ksmd            S c03785b4     0    18      2 0x00000000
[   91.616726] fsnotify_mark   S c03785b4     0    19      2 0x00000000
[   91.623168] nfsiod          S c03785b4     0    20      2 0x00000000
[   91.629600] crypto          S c03785b4     0    21      2 0x00000000
[   91.636031] kthrotld        S c03785b4     0    30      2 0x00000000
[   91.642461] VCHIQ-0         S c03785b4     0    31      2 0x00000000
[   91.649008] VCHIQr-0        S c03785b4     0    32      2 0x00000000
[   91.655531] VCHIQs-0        S c03785b4     0    33      2 0x00000000
[   91.662059] iscsi_eh        S c03785b4     0    34      2 0x00000000
[   91.668496] dwc_otg         S c03785b4     0    35      2 0x00000000
[   91.674926] DWC Notificatio S c03785b4     0    36      2 0x00000000
[   91.681355] kworker/u:1     S c03785b4     0    37      2 0x00000000
[   91.687784] deferwq         S c03785b4     0    38      2 0x00000000
[   91.694214] kworker/u:2     R running      0    39      2 0x00000000
[   91.707090] kworker/0:1H    R running      0    93      2 0x00000000
[   91.713522] spl_kmem_cache/ S c03785b4     0   149      2 0x00000000
[   91.720313] spl_system_task R running      0   150      2 0x00000000
[   91.727028] arc_adapt       R running      0   151      2 0x00000000
[   91.735829] l2arc_feed      R running      0   152      2 0x00000000
[   91.744549] z_unmount/0     S c03785b4     0   153      2 0x00000000
[   91.751279] zvol/0          S c03785b4     0   154      2 0x00000000
[   91.757995] zvol/1          S c03785b4     0   155      2 0x00000000
[   91.764705] zvol/2          S c03785b4     0   156      2 0x00000000
[   91.771396] zvol/3          S c03785b4     0   157      2 0x00000000
[   91.778106] zvol/4          S c03785b4     0   158      2 0x00000000
[   91.784813] zvol/5          S c03785b4     0   159      2 0x00000000
[   91.791527] zvol/6          S c03785b4     0   160      2 0x00000000
[   91.798240] zvol/7          S c03785b4     0   161      2 0x00000000
[   91.804934] zvol/8          S c03785b4     0   162      2 0x00000000
[   91.811646] zvol/9          S c03785b4     0   163      2 0x00000000
[   91.818355] zvol/10         S c03785b4     0   164      2 0x00000000
[   91.825068] zvol/11         S c03785b4     0   165      2 0x00000000
[   91.831780] zvol/12         S c03785b4     0   166      2 0x00000000
[   91.838469] zvol/13         S c03785b4     0   167      2 0x00000000
[   91.845178] zvol/14         S c03785b4     0   168      2 0x00000000
[   91.851891] zvol/15         S c03785b4     0   169      2 0x00000000
[   91.858606] zvol/16         S c03785b4     0   170      2 0x00000000
[   91.865313] zvol/17         S c03785b4     0   171      2 0x00000000
[   91.872005] zvol/18         S c03785b4     0   172      2 0x00000000
[   91.878714] zvol/19         S c03785b4     0   173      2 0x00000000
[   91.885423] zvol/20         S c03785b4     0   174      2 0x00000000
[   91.892131] zvol/21         S c03785b4     0   175      2 0x00000000
[   91.898840] zvol/22         S c03785b4     0   176      2 0x00000000
[   91.905534] zvol/23         S c03785b4     0   177      2 0x00000000
[   91.912243] zvol/24         S c03785b4     0   178      2 0x00000000
[   91.918950] zvol/25         S c03785b4     0   179      2 0x00000000
[   91.925664] zvol/26         S c03785b4     0   180      2 0x00000000
[   91.932372] zvol/27         S c03785b4     0   181      2 0x00000000
[   91.939059] zvol/28         S c03785b4     0   182      2 0x00000000
[   91.945772] zvol/29         S c03785b4     0   183      2 0x00000000
[   91.952488] zvol/30         S c03785b4     0   184      2 0x00000000
[   91.959202] zvol/31         S c03785b4     0   185      2 0x00000000
[   91.965913] md              S c03785b4     0   190      2 0x00000000
[   91.972356] jbd2/mmcblk0p4- R running      0   205      2 0x00000000
[   91.978827] ext4-dio-unwrit S c03785b4     0   206      2 0x00000000
[   91.985261] udevd           R running      0   347      1 0x00000000
[   91.998104] udevd           S c03785b4     0   464    347 0x00000000
[   92.004550] udevd           S c03785b4     0   465    347 0x00000000
[   92.010986] flush-179:0     R running      0  1319      2 0x00000000
[   92.017468] kworker/0:2     S c03785b4     0  1580      2 0x00000000
[   92.023904] dhclient        S c03785b4     0  1591      1 0x00000000
[   92.030402] rsyslogd        S c03785b4     0  1835      1 0x00000000
[   92.036892] rs:main Q:Reg   S c03785b4     0  1838      1 0x00000000
[   92.043394] rsyslogd        S c03785b4     0  1841      1 0x00000000
[   92.049889] rsyslogd        S c03785b4     0  1842      1 0x00000000
[   92.056409] cron            S c03785b4     0  1902      1 0x00000000
[   92.062876] sshd            S c03785b4     0  1932      1 0x00000000
[   92.069371] sshd            S c03785b4     0  1944   1932 0x00000000
[   92.075860] bash            S c03785b4     0  1952   1944 0x00000000
[   92.082298] sshd            S c03785b4     0  1962   1932 0x00000000
[   92.088788] vmstat          R running      0  1964   1962 0x00000000
[   92.095244] ntpd            R running      0  2133      1 0x00000001
[   92.101734] getty           S c03785b4     0  2175      1 0x00000000
[   92.108277] getty           S c03785b4     0  2176      1 0x00000000
[   92.114805] sshd            S c03785b4     0  2177   1932 0x00000000
[   92.121299] vmstat          R running      0  2179   2177 0x00000000
[   92.127756] flush-253:0     S c03785b4     0  2183      2 0x00000000
[   92.134210] zpool           R running      0  2185   1952 0x00000006
[   92.156184] z_null_iss/0    S c03785b4     0  2348      2 0x00000000
[   92.163018] z_null_int/0    S c03785b4     0  2349      2 0x00000000
[   92.169722] z_rd_iss/0      S c03785b4     0  2350      2 0x00000000
[   92.176436] z_rd_iss/1      S c03785b4     0  2351      2 0x00000000
[   92.183147] z_rd_iss/2      S c03785b4     0  2352      2 0x00000000
[   92.189853] z_rd_iss/3      S c03785b4     0  2353      2 0x00000000
[   92.196561] z_rd_iss/4      S c03785b4     0  2354      2 0x00000000
[   92.203257] z_rd_iss/5      S c03785b4     0  2355      2 0x00000000
[   92.209970] z_rd_iss/6      S c03785b4     0  2356      2 0x00000000
[   92.216684] z_rd_iss/7      S c03785b4     0  2357      2 0x00000000
[   92.223395] z_rd_int/0      S c03785b4     0  2358      2 0x00000000
[   92.230106] z_wr_iss/0      S c03785b4     0  2359      2 0x00000000
[   92.236796] z_wr_iss_h/0    S c03785b4     0  2360      2 0x00000000
[   92.243505] z_wr_iss_h/1    S c03785b4     0  2361      2 0x00000000
[   92.250217] z_wr_iss_h/2    S c03785b4     0  2362      2 0x00000000
[   92.256925] z_wr_iss_h/3    S c03785b4     0  2363      2 0x00000000
[   92.263634] z_wr_iss_h/4    S c03785b4     0  2364      2 0x00000000
[   92.270323] z_wr_int/0      S c03785b4     0  2365      2 0x00000000
[   92.277034] z_wr_int/1      S c03785b4     0  2366      2 0x00000000
[   92.283747] z_wr_int/2      S c03785b4     0  2367      2 0x00000000
[   92.290453] z_wr_int/3      S c03785b4     0  2368      2 0x00000000
[   92.297163] z_wr_int/4      S c03785b4     0  2369      2 0x00000000
[   92.303863] z_wr_int/5      S c03785b4     0  2370      2 0x00000000
[   92.310572] z_wr_int/6      S c03785b4     0  2371      2 0x00000000
[   92.317288] z_wr_int/7      S c03785b4     0  2372      2 0x00000000
[   92.324003] z_wr_int/8      S c03785b4     0  2373      2 0x00000000
[   92.330715] z_wr_int/9      S c03785b4     0  2374      2 0x00000000
[   92.337406] z_wr_int/10     S c03785b4     0  2375      2 0x00000000
[   92.344115] z_wr_int/11     S c03785b4     0  2376      2 0x00000000
[   92.350828] z_wr_int/12     S c03785b4     0  2377      2 0x00000000
[   92.357537] z_wr_int/13     S c03785b4     0  2378      2 0x00000000
[   92.364246] z_wr_int/14     S c03785b4     0  2379      2 0x00000000
[   92.370935] z_wr_int/15     S c03785b4     0  2380      2 0x00000000
[   92.377648] z_wr_int_h/0    S c03785b4     0  2381      2 0x00000000
[   92.384360] z_wr_int_h/1    S c03785b4     0  2382      2 0x00000000
[   92.391071] z_wr_int_h/2    S c03785b4     0  2383      2 0x00000000
[   92.397778] z_wr_int_h/3    S c03785b4     0  2384      2 0x00000000
[   92.404472] z_wr_int_h/4    S c03785b4     0  2385      2 0x00000000
[   92.411188] z_fr_iss_0/0    S c03785b4     0  2386      2 0x00000000
[   92.417895] z_fr_iss_0/1    S c03785b4     0  2387      2 0x00000000
[   92.424606] z_fr_iss_0/2    S c03785b4     0  2388      2 0x00000000
[   92.431321] z_fr_iss_0/3    S c03785b4     0  2389      2 0x00000000
[   92.438011] z_fr_iss_1/0    S c03785b4     0  2390      2 0x00000000
[   92.444718] z_fr_iss_1/1    S c03785b4     0  2391      2 0x00000000
[   92.451431] z_fr_iss_1/2    S c03785b4     0  2392      2 0x00000000
[   92.458142] z_fr_iss_1/3    S c03785b4     0  2393      2 0x00000000
[   92.464853] z_fr_iss_2/0    S c03785b4     0  2394      2 0x00000000
[   92.471543] z_fr_iss_2/1    S c03785b4     0  2395      2 0x00000000
[   92.478251] z_fr_iss_2/2    S c03785b4     0  2396      2 0x00000000
[   92.484963] z_fr_iss_2/3    S c03785b4     0  2397      2 0x00000000
[   92.491673] z_fr_iss_3/0    S c03785b4     0  2398      2 0x00000000
[   92.498374] z_fr_iss_3/1    S c03785b4     0  2399      2 0x00000000
[   92.505067] z_fr_iss_3/2    S c03785b4     0  2400      2 0x00000000
[   92.511773] z_fr_iss_3/3    S c03785b4     0  2401      2 0x00000000
[   92.518478] z_fr_iss_4/0    S c03785b4     0  2402      2 0x00000000
[   92.525188] z_fr_iss_4/1    S c03785b4     0  2403      2 0x00000000
[   92.531896] z_fr_iss_4/2    S c03785b4     0  2404      2 0x00000000
[   92.538589] z_fr_iss_4/3    S c03785b4     0  2405      2 0x00000000
[   92.545298] z_fr_iss_5/0    S c03785b4     0  2406      2 0x00000000
[   92.552010] z_fr_iss_5/1    S c03785b4     0  2407      2 0x00000000
[   92.558718] z_fr_iss_5/2    S c03785b4     0  2408      2 0x00000000
[   92.565426] z_fr_iss_5/3    S c03785b4     0  2409      2 0x00000000
[   92.572114] z_fr_iss_6/0    S c03785b4     0  2410      2 0x00000000
[   92.578823] z_fr_iss_6/1    S c03785b4     0  2411      2 0x00000000
[   92.585533] z_fr_iss_6/2    S c03785b4     0  2412      2 0x00000000
[   92.592239] z_fr_iss_6/3    S c03785b4     0  2413      2 0x00000000
[   92.598952] z_fr_iss_7/0    S c03785b4     0  2414      2 0x00000000
[   92.605639] z_fr_iss_7/1    S c03785b4     0  2415      2 0x00000000
[   92.612348] z_fr_iss_7/2    S c03785b4     0  2416      2 0x00000000
[   92.619057] z_fr_iss_7/3    S c03785b4     0  2417      2 0x00000000
[   92.625767] z_fr_int/0      S c03785b4     0  2418      2 0x00000000
[   92.632475] z_cl_iss/0      S c03785b4     0  2419      2 0x00000000
[   92.639173] z_cl_int/0      S c03785b4     0  2420      2 0x00000000
[   92.645883] z_ioctl_iss/0   S c03785b4     0  2421      2 0x00000000
[   92.652600] z_ioctl_int/0   S c03785b4     0  2422      2 0x00000000
[   92.659309] zfs_iput_taskq/ S c03785b4     0  2428      2 0x00000000
[   92.666019] txg_quiesce     S c03785b4     0  2429      2 0x00000000
[   92.675260] txg_sync        D c03785b4     0  2430      2 0x00000000
[   92.690261] spa_async       D c03785b4     0  2431      2 0x00000000
@ewheelerinc

Blocked D-state tasks (sysreq-w):

[  395.165639] SysRq : Show Blocked State
[  395.169415]   task                PC stack   pid father
[  395.174735] txg_sync        D c03785b4     0  2430      2 0x00000000
[  395.189919] spa_async       D c03785b4     0  2431      2 0x00000000
@ewheelerinc

More info on the problem:

It looks like someone on this thread is having the same problem, their strace matches mine:
http://osdir.com/ml/zfs-discuss/2012-12/msg00095.html

This is the tail end of the strace:

read(6, "127.0.0.1\tlocalhost rpi\n::1\t\tloc"..., 4096) = 160
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0xb6fa2000, 4096)                = 0
open("/usr/share/locale/locale.alias", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, 0x5a02

Note that the rest of ioctl() on the last line didn't print. The connection hung at that point.

@ewheelerinc

Hmm, well zfs.h the enum 0x5a02 is ZFS_IOC_POOL_IMPORT:

#define ZFS_IOC   ('Z' << 8)

typedef enum zfs_ioc {
    ZFS_IOC_POOL_CREATE = ZFS_IOC,  // 0x5a00
    ZFS_IOC_POOL_DESTROY, // 0x5a01
    ZFS_IOC_POOL_IMPORT, // 0x5a02

so the previous post may not be completely related, just, both were hanging during import.

@ewheelerinc

This patch fixes the problem:

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index b41eeb2..43df7cc 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -1363,7 +1363,7 @@ __zvol_create_minor(const char *name, boolean_t ignore_snapdev)
    zv->zv_volsize = volsize;
    zv->zv_objset = os;

- set_capacity(zv->zv_disk, zv->zv_volsize >> 9);
+ set_capacity(zv->zv_disk, 0);

    blk_queue_max_hw_sectors(zv->zv_queue, UINT_MAX);
    blk_queue_max_segments(zv->zv_queue, UINT16_MAX);
@@ -1397,6 +1397,12 @@ out:
    if (error == 0) {
      zvol_insert(zv);
      add_disk(zv->zv_disk);
+    /* You must set_capacity() to 0 unless calling setcapacity() *after*
+       you call add_disk(), according to these references:
+         http://osdir.com/ml/linux.enbd.general/2002-10/msg00176.html
+         http://osdir.com/ml/linux.enbd.general/2002-10/msg00177.html
+         http://stackoverflow.com/questions/13518404/add-disk-hangs-on-insmod */ 
+    set_capacity(zv->zv_disk, zv->zv_volsize >> 9);
    }

    return (error);
@ryao
Member
ryao commented Jun 3, 2013

Which distribution and kernel version do you have?

@ewheelerinc

3.8.13+ latest git for bcm2708 on arm.

@ewheelerinc

distribution is debian and zfs build is from git.

I was able to create a test case which reliably failed---and after applying this fix it worked reliably.

@behlendorf
Member

@ewheelerinc Can you post the test case, or better yet the stack trace from the process while it's hung. I'd like to understand why this is required. It appears that many of the in-kernel block devices are able to safely call set_capacity() before add_disk(). That said, this still could very likely be the right fix.

@ewheelerinc

@behlendorf
The pull request to address this bug describes my test case: #1491 , reposted here:

echo =========== Succeeds ======="
zpool import

echo =========== Succeeds ======="
zpool export zpool

echo "=========== Fails ======="
rmmod zfs zunicode zavl zcommon znvpair spl
modprobe zfs
zpool import
zpool import -f zpool # fails here
sleep 1
zpool list

I'm not sure if this is specific to data on the volumes I had configured or not, but this test was reliable with and reliably failed without the patch.

Without the zfs set_capacity patch I was able to trace the kernel deadlock to this point in the 3.8.13+ kernel tree (the rpi-3.8.y branch of the Raspberry Pi kernel tree). The kernel hangs during the call to add_disk():

diff --git a/block/genhd.c b/block/genhd.c
index 7dcfdd8..638a7de 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -613,7 +613,9 @@ void add_disk(struct gendisk *disk)

    blk_register_region(disk_devt(disk), disk->minors, NULL,
            exact_match, exact_lock, disk);
+printk(KERN_CRIT "block/genhd.c: add_disk() line %d: fails next line:  problem register_disk()?\n", __LINE__);
    register_disk(disk);
+printk(KERN_CRIT "block/genhd.c: add_disk() line %d: never gets here.  problem in register_disk()\n", __LINE__);
    blk_register_queue(disk);

    /*

I didn't dig any deeper than this.

It would be nice if the kernel said something like "warning: call set_capacity() after add_disk()" before hanging, but perhaps it doesn't know how to detect this case.

@behlendorf
Member

@ewheelerinc With the proposed patch zconfig.sh test 3 fails for me under RHEL6.x (2.6.32). Because the capacity is set to 0 during add_disk() the code to create the partitions in register_disk() is never run. Running partprobe after the test results in the partitions being created as expected.

We're going to need to dig in to exactly why it deadlocks on your rpi kernel because I can't merge this as is. Can you get a back trace from the deadlocked process?

@ewheelerinc

Interesting, ok. I'll see if I can recreate the problem and get a better trace.

@behlendorf
Member

@ewheelerinc If you've still got your reproducer set up can you verify the problem still exists when which running with the patches in #1564 applied.

@behlendorf
Member

Bumping to the 0.6.3 tag, this change can't me merged until we get to the root cause of the deadlock.

@FransUrbo
Member

@ewheelerinc Is this still a problem (over a year later) with latest git HEAD?

@behlendorf
Member

Closing as stale.

@behlendorf behlendorf closed this Oct 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment