Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool frozen (bad hdd), running processes stuck, zpool not working anymore #3233

Closed
basic6 opened this issue Mar 28, 2015 · 24 comments
Closed
Milestone

Comments

@basic6
Copy link

basic6 commented Mar 28, 2015

Earlier today, processes using the storage got stuck and zpool commands now get stuck as well, except zpool status, which shows no errors (all 0).
Trigger of this issue might be a bad hdd (smart error) and a second almost-as-bad (first bad sector found) hdd. It's a raidz3 pool, so two failing drives should not be a problem.
However, since ZFS isn't throwing any errors, it is not clear which hdd triggered this (or maybe it was unrelated to the hdds).

Trying to zpool offline the bad drive (with the error) got stuck, but zpool status in another terminal shows that it's been offlined. Trying to replace it with the new drive got stuck, but nothing is happening (status output still showing old drive, now offline).

iostat -dmx 1 did not show a constant value of 100%.

Since no error message can be found - ZFS is not saying what's going on -, "echo w > /proc/sysrq-trigger" was issued. Here is a part of what's been logged (the part with the stack traces): https://gist.github.com/basic6/a1b6c6a27d81a173966e

It looks like ZFS got stuck so bad that a hard reboot will be necessary (losing some work because those running processes would have to be killed).

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

Host is a Debian 8 with kernel 3.16.0-4-amd64. ZoL installed from the repository.

SPL: Loaded module v0.6.3-296_g353be12
ZFS: Loaded module v0.6.3-766_gfde0d6d, ZFS pool version 5000, ZFS filesystem version 5

@kernelOfTruth
Copy link
Contributor

mea culpa, I, myself posted on the wrong issue

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

@kernelOfTruth why wrong issue?

@kernelOfTruth
Copy link
Contributor

@basic6 I wanted to post in #3232 , confused your issue with the 1 comment with mine,

wanted to write that you answered on the wrong issue,

but then realized that it was actually me who got it wrong 😏

your issue could be related to #3215 or #3175

so taking the advise from those issues:

if you want a fast answer - take a look at the mailing list or the IRC channels

otherwise it could take some hours (or in the worst case days here)

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

Oh, I see, no problem. Bug 3215 mentions corrupted data but zpool status does not show the same error in my case. According to the status output, everything is perfectly fine, zero errors. In Bug 3175, the user gets a "pool is busy" error, which I don't get.
(I have to agree with dpickford, I've also had many Seagate drives die recently, one after the other.)

I have already asked on IRC, DeHackEd did not know what's going (so it can't be easy).
I'll wait a bit before I reboot...

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

After rebooting, all zfs/zpool commands get stuck (but there is a lot of hdd activity):

[ 600.164080] INFO: task zpool:2516 blocked for more than 120 seconds.
[ 600.164084] Tainted: P O 3.16.0-4-amd64 #1
[ 600.164085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 600.164086] zpool D ffff88079444f7c8 0 2516 1 0x00000000
[ 600.164090] ffff88079444f370 0000000000000086 0000000000012f00 ffff8807674d7fd8
[ 600.164093] 0000000000012f00 ffff88079444f370 ffffffffa0bbd140 ffff8807674d7db0
[ 600.164095] ffffffffa0bbd144 ffff88079444f370 00000000ffffffff ffffffffa0bbd148
[ 600.164097] Call Trace:
[ 600.164109] [] ? schedule_preempt_disabled+0x25/0x70
[ 600.164111] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[ 600.164117] [] ? kmem_alloc_debug+0x19e/0x3c0 [spl]
[ 600.164120] [] ? mutex_lock+0x1b/0x2a
[ 600.164136] [] ? spa_open_common+0x54/0x3e0 [zfs]
[ 600.164148] [] ? spa_get_stats+0x38/0x2b0 [zfs]
[ 600.164159] [] ? zfs_ioc_pool_stats+0x24/0x60 [zfs]
[ 600.164168] [] ? zfsdev_ioctl+0x464/0x4e0 [zfs]
[ 600.164172] [] ? dput+0x1f/0x170
[ 600.164174] [] ? do_vfs_ioctl+0x2cf/0x4b0
[ 600.164178] [] ? task_work_run+0x9c/0xd0
[ 600.164180] [] ? SyS_ioctl+0x81/0xa0
[ 600.164183] [] ? page_fault+0x28/0x30
[ 600.164186] [] ? system_call_fast_compare_end+0x10/0x15

Also, the ZFS volumes cannot be mounted, but they do show up as directories in /datapool. However, a df on one of them says that it's on the root disk, not ZFS.

I recently updated a few packages on the system, maybe this is related. There are now updates available for the zfs/zfs-dkms/spl packages, so I'll update the system and try again.

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

System update did not fix it.

After rebooting, iostat shows a lot of activity for all the drives in the pool, except the ones offlined (the bad drive and the drive with the bad sector):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    5.00     0.00     0.03    11.20     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    1.00    5.00     0.02     0.03    14.67     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00  731.00    0.00    18.47     0.00    51.74     0.25    0.34    0.34    0.00   0.21  15.60
sdd               0.00     0.00  705.00    0.00    18.56     0.00    53.92     0.69    0.98    0.98    0.00   0.39  27.20
sde               0.00     0.00  709.00    0.00    19.02     0.00    54.94     0.26    0.36    0.36    0.00   0.20  14.00
sdf               0.00     0.00  686.00    0.00    19.07     0.00    56.94     0.50    0.73    0.73    0.00   0.33  22.80
sdi               0.00     0.00  695.00    0.00    18.49     0.00    54.49     0.27    0.39    0.39    0.00   0.25  17.20
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00  659.00    0.00    18.53     0.00    57.58     0.78    1.20    1.20    0.00   0.57  37.60
sdl               0.00     0.00  657.00    0.00    18.84     0.00    58.72     0.50    0.77    0.77    0.00   0.35  22.80

Not sure what ZFS is doing, nothing logged in dmesg, ZFS isn't saying what's going on. Given this hdd activity, it seems like ZFS is doing something...

zpool status getting stuck, just like all other zfs/zpool commands.

@basic6
Copy link
Author

basic6 commented Mar 28, 2015

Looks like it's working again!

After one of several reboots, iostat showed 100% util for sdk (which is a pool member) and 0% for all the other drives. I have physically removed it and rebooted. At first, zpool status took a very long time (iostat showed changing util values, as shown in the last comment). After what felt like 15 minutes, zpool status returned. No read/write/checksum errors.
Access to the ZFS volumes is working again. The first missing drive (the one with the smart error) is already being replaced.

So first of all - the zpool is alive again. I still don't really know what happened, but it's working again.
Given that iostat showed 100% for sdk at one point, it might look like this drive might have been the root cause for all of this (responsible for freezing the pool in the first place). However, this is not the drive with the smart error, nor is it the drive with a bad block (those will be replaced, of course).
In fact, it is a WD RED drive with no smart errors logged. Also, smart values are good (mostly 0). Hard to believe that this drive would cause any error at all. I'll run badblocks on it.

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

My pool is frozen again!

So I highly doubt that the WD RED drive I removed (which looks healthy) had anything to do with it.
Update: Even badblocks confirmed this drive to be healthy, so it's back in the pool again.

The first bad hdd (smart error) has been replaced already, the second bad drive is still connected, but marked as "FAULTED". I'll remove it (and maybe put the WD RED drive back in).

Kernel: 3.16.0-4-amd64

ii  debian-zfs                            7~jessie                            amd64        Native ZFS filesystem metapackage for Debian.
ii  libzfs2                               0.6.3-1.3-1~jessie                  amd64        Native ZFS filesystem library for Linux
ii  zfs-dkms                              0.6.3-1.3-1~jessie                  all          Native ZFS filesystem kernel modules for Linux
ii  zfsonlinux                            4                                   all          archive.zfsonlinux.org trust package
ii  zfsutils                              0.6.3-1.3-1~jessie                  amd64        command-line tools to manage ZFS filesystems

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

# cat /proc/spl/kstat/zfs/arcstats 
5 1 0x01 85 4080 11757122744 289290776918431
name                            type data
hits                            4    12651226
misses                          4    3147960
demand_data_hits                4    3476632
demand_data_misses              4    355053
demand_metadata_hits            4    4573316
demand_metadata_misses          4    996418
prefetch_data_hits              4    436339
prefetch_data_misses            4    930194
prefetch_metadata_hits          4    4164939
prefetch_metadata_misses        4    866295
mru_hits                        4    3918224
mru_ghost_hits                  4    336347
mfu_hits                        4    5900958
mfu_ghost_hits                  4    277469
deleted                         4    1535952
recycle_miss                    4    15274
mutex_miss                      4    83
evict_skip                      4    1171689
evict_l2_cached                 4    0
evict_l2_eligible               4    111581419008
evict_l2_ineligible             4    149577219072
hash_elements                   4    1111700
hash_elements_max               4    1169005
hash_collisions                 4    2986053
hash_chains                     4    327493
hash_chain_max                  4    13
p                               4    2071007744
c                               4    14857951232
c_min                           4    4194304
c_max                           4    16864739328
size                            4    14857934320
hdr_size                        4    402479336
data_size                       4    3315299328
meta_size                       4    9549903360
other_size                      4    1590252296
anon_size                       4    827431424
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    2495046144
mru_evict_data                  4    512
mru_evict_metadata              4    0
mru_ghost_size                  4    12362868736
mru_ghost_evict_data            4    10437326336
mru_ghost_evict_metadata        4    1925542400
mfu_size                        4    9542725120
mfu_evict_data                  4    2560
mfu_evict_metadata              4    8979083264
mfu_ghost_size                  4    2495077376
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    2495077376
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    67554
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    54
arc_meta_used                   4    11542634992
arc_meta_limit                  4    12648554496
arc_meta_max                    4    12649190888

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

Stacktraces after "echo w > /proc/sysrq-trigger":
https://gist.githubusercontent.com/basic6/6272ce5397045f1f58c7/raw/gistfile1.txt
sdh is the other bad drive that was still connected, but unused ("FAULTED"); it's disconnected now.

DeHackEd mentioned a zfs rename operation which appears to be in progress. The only thing that might run zfs rename (afaik) is the snapshot rotation cronjob. I might disable it temporarily and maybe delete a few snapshots.

Also, using the daily build was suggested (with a warning that it might introduce other issues). This is the enabled line in /etc/apt/sources.list.d/zfsonlinux.list:
deb [arch=amd64] http://archive.zfsonlinux.org/debian jessie main

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

iostat shows no hdd activity:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

Again, ZFS commands (like zfs list, after a reboot probably zpool status as well) are just locking up, not throwing any error messages, leaving me to guess what might or might not be the issue.

The pool is not running out of space either, there's more than 5 TB space available.

@basic6
Copy link
Author

basic6 commented Mar 31, 2015

On a sidenote: After switching from debian to debian-daily (zfsonlinux repo file), the installation gets stuck at this point:
Installing new version of config file /etc/init.d/zfs-mount ...
And "df /datapool" already returns /dev/sda1 (like last time, after the reboot).

Hanging installation was resolved by killing (SIGKILL) the child process "/bin/systemctl start zfs.target".

After installing from debian-daily, this version is now installed:

# dpkg -l | grep -i zfs
ii  debian-zfs                            7~jessie                            amd64        Native ZFS filesystem metapackage for Debian.
ii  libzfs2                               0.6.3-40-0f7d2a                     amd64        Native ZFS filesystem library for Linux
ii  libzpool2                             0.6.3-40-0f7d2a                     amd64        Native ZFS pool library for Linux
ii  zfs-dkms                              0.6.3-40-0f7d2a                     all          Native ZFS filesystem kernel modules for Linux
ii  zfsonlinux                            5                                   all          archive.zfsonlinux.org trust package
ii  zfsutils                              0.6.3-40-0f7d2a                     amd64        command-line tools to manage ZFS filesystems

This is what almost any zfs command looks like, getting stuck:

# zfs list &
[2] 1064
# cat /proc/1064/stack
[<ffffffffa0798285>] cv_wait_common+0x115/0x1b0 [spl]
[<ffffffff810a7820>] autoremove_wake_function+0x0/0x30
[<ffffffffa0ac20eb>] rrw_enter_read+0x3b/0x140 [zfs]
[<ffffffffa0ab4865>] dsl_pool_hold+0x45/0x50 [zfs]
[<ffffffffa0a968d2>] dmu_objset_hold+0x22/0xb0 [zfs]
[<ffffffffa0afcd2e>] zfs_ioc_objset_stats+0x1e/0x50 [zfs]
[<ffffffffa0affb14>] zfsdev_ioctl+0x464/0x4e0 [zfs]
[<ffffffff811b9ecf>] do_vfs_ioctl+0x2cf/0x4b0
[<ffffffff811ba131>] SyS_ioctl+0x81/0xa0
[<ffffffff81511fa8>] page_fault+0x28/0x30
[<ffffffff8150ff6d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

Running cat on a file on the storage does actually prints its contents, but then gets stuck as well - and the stack looks identical:

[<ffffffffa0798285>] cv_wait_common+0x115/0x1b0 [spl]
[<ffffffff810a7820>] autoremove_wake_function+0x0/0x30
[<ffffffffa0a9e93b>] dmu_tx_wait+0x7b/0x2d0 [zfs]
[<ffffffffa0a9ec1c>] dmu_tx_assign+0x8c/0x4b0 [zfs]
[<ffffffffa0b0c801>] zfs_dirty_inode+0x91/0x270 [zfs]
[<ffffffff8117a185>] free_pages_and_swap_cache+0x95/0xb0
[<ffffffff8116493c>] tlb_flush_mmu_free+0x2c/0x50
[<ffffffff8116537f>] tlb_finish_mmu+0xf/0x40
[<ffffffff8116ca11>] unmap_region+0xb1/0xf0
[<ffffffff811ce562>] __mark_inode_dirty+0x172/0x270
[<ffffffffa0b263ad>] zpl_release+0x2d/0x60 [zfs]
[<ffffffff811a95aa>] __fput+0xca/0x1d0
[<ffffffff81085027>] task_work_run+0x97/0xd0
[<ffffffff81012ea9>] do_notify_resume+0x69/0xa0
[<ffffffff8151022a>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

To be clear, I would have replaced the other bad drive (marked as "FAULTED") by now if ZFS wasn't frozen.

@basic6
Copy link
Author

basic6 commented Apr 1, 2015

Memory usage (although that's probably not it because it would have worked after the reboot last time this happened):

# free -m
             total       used       free     shared    buffers     cached
Mem:         32166      31761        405         80          2       2414
-/+ buffers/cache:      29345       2821
Swap:        71525          0      71525

# cat /proc/spl/kstat/zfs/arcstats | grep -E '(max|arc)'
hash_elements_max               4    1169005
hash_chain_max                  4    13
c_max                           4    16864739328
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    54
arc_meta_used                   4    10477741952
arc_meta_limit                  4    12648554496
arc_meta_max                    4    12649190888

@basic6
Copy link
Author

basic6 commented Apr 1, 2015

As discussed with ryao, here's a paste of all running process stacks: http://pastee.co/pMIM8t
<+ryao> basic6__: This isn't a stuck IO. You have a deadlock.
<+ryao> txg_sync is waiting for zvol_state_lock while /sbin/zfs snapshot -r z-main/user nightly.0 has it and is waiting on dp->dp_config_rwlock.
<+ryao> basic6__: And I am going to guess that whoever has dp->dp_config_rwlock is doing something to complete the circle.
<+ryao> basic6__: It is one of the things participating in the race.
<+ryao> basic6__: zfs rename has it.
<+ryao> basic6__: It looks like there might be some ? missing from the stack frames here...
<+ryao> basic6__: It looks like you have a bonafide deadlock between the sync thread, zfs rename and zfs snapshot.
<+ryao> basic6__: Okay. zfs_ioc_rename() should have the pool rrwlock for writing.
<+ryao> basic6__: It waits on the transaction group sync. Then the snapshot thread has the zvol state lock and waits on the pool rwlock. Then the syncthread wants the zvol state lock.
<+ryao> basic6__: And everything stops.

@basic6
Copy link
Author

basic6 commented Apr 2, 2015

After rebooting (as suggested, to try the daily build), the system won't even boot anymore because apparently it's now trying to access the pool before I get a shell, but the pool isn't accessible (probably until I remove another healthy drive like last time) due to this deadlock.
"A start job is running for import zfs pools" "no limit"
However, the activity lights of all (online) drives in the pool are flashing, so something is happening.

@FransUrbo
Copy link
Contributor

Try booting with init=/bin/bash and then disable the zfs init scripts chmod -x /etc/init.d/zfs*.

Also make sure there's no zfs stuff in your initrd lsinitramfs /boot/initrd.img-uname -r | grep zfs. That would at least let you boot the machine later without ZFS being started…

If you're really paranoid, you can either remove the modules (or at least move them out the way so they don't get loaded automatically) mv /lib/modules/uname -r/extra/{spl,zfs} /root/.

@basic6
Copy link
Author

basic6 commented Apr 2, 2015

Correction - I was too impatient. After probably 5 - 10 minutes, the system did boot and is now running. ZFS is working again at the moment (using v0.6.3-40-0f7d2a, according to dmesg).
Based on what ryao said about the rare deadlock, I have disabled my snapshot rotation cronjob. I am currently replacing the other bad drive with a new one.

@FransUrbo
Copy link
Contributor

Good fortune then!

@basic6
Copy link
Author

basic6 commented Apr 2, 2015

Thanks. I'll try a few things, put some load on the system, keep the snapshot rotation cronjob disabled for a few days and report back if something happens.

@FransUrbo
Copy link
Contributor

Might be time to do a full backup to other hardware… Just in case...

@basic6
Copy link
Author

basic6 commented Apr 4, 2015

Good idea, the most important volumes already have an up-to-date backup, but I should run a full backup for the other volumes.

The system is running normally at this point, except that I still have the snapshot rotation cronjob disabled, which apparently somehow triggered the deadlock.

@behlendorf
Copy link
Contributor

@basic6 the hang you observed was caused by #3082. A fix for this, 417104b, was committed a few weeks ago and is in the newly released 0.6.4 tag.

@behlendorf behlendorf added this to the 0.6.4 milestone Apr 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants