Fix deadlocks in DMU #726

ryao · 2012-05-07T20:54:22Z

DMU operations perform allocations using KM_SLEEP, which is mapped to GFP_KERNEL in the SPL. GFP_KERNEL permits paging to occur in low memory situations. When swap operations are done in zvols, memory allocations in the DMU will often trigger additional swap operations, which deadlocks.

I observed a nonsensical behavior where swap worked with the loop device, but not zvols. Commit cfc9a5c appears to have disabled the direct reclaim path on the filesystem operations that the loop device exercises, which explains the cause of that behavior. This lead me to replace all instances of KM_SLEEP with KM_PUSHPAGE in files containing functions listed in the backtrace in the commit message. That enabled swap on zvols to work. I tested swap on a zvol with KM_SLEEP redefined to be GFP_NOIO in the SPL code, but swap operations failed to function. As such, it seems that the use of KM_PUSHPAGE is necessary to address this issue.

In theory, this change should also prevent the original deadlock, but I have not yet tested it to verify that this prevents it. I have also not verified that all allocations that swap can trigger have been converted to KM_PUSHPAGE.

I am opening this pull request so that others can provide their thoughts while I continue to work on this. This will close #342 when it is merged.

behlendorf · 2012-05-10T22:57:00Z

My initial testing of these changes look promising. I pulled the change in to a testing branch and ran it through the test suite for 12 different Linux distributions. The test suite includes xfstests and I was unable to hit the original issue on any of those 12 runs. I'll run it through a few more times for better coverage since I remember this issue not always reproducing. Also IIRC rtorrent which does a lot of mmap'ed I/O was particularly good to causing this deadlock.

ryao · 2012-05-14T16:17:49Z

I have pushed an additional patch that resolves a few theroetical deadlocks that were uncaught by my previous patch set, including one that could trigger the deadlock commit cfc9a5c was meant to address.

There is still more code that I need to review, but I am making steady progress. I hope to have another update in a few days.

behlendorf · 2012-05-14T17:37:32Z

Great. Yes, the trick as I'm sure your aware here will be find all the potential call point where KM_PUSHPAGE needs to be used.

ryao · 2012-05-17T04:54:35Z

@behlendorf The last patch I pushed addressed all of the page allocation failures that were being output to dmesg during my stress tests as well as the consumption of pages from ZONE_DMA. The only known issue at this point are the following messages in dmesg, which are fairly harmless:

[ 239.786572] hrtimer: interrupt took 11339 ns
[ 1027.551953] CE: hpet increased min_delta_ns to 20113 nsec
[ 8825.579004] CE: hpet increased min_delta_ns to 30169 nsec

I have also finished reviewing code. It should be safe to merge these. They should close issue #342, issue #540, issue #746 and possibly a few that I missed.

behlendorf · 2012-05-18T16:56:29Z

Excellent, I'll review the patch today and run it through my test suites. If everything looks good I'll get it merged. Thanks for working on this!

Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The z_wr_* threads * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726

Commit eec8164 worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726

The commit, cfc9a5c, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726

This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0cc eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726

behlendorf · 2012-05-21T18:48:27Z

@ryao I've taken your work and massaged it a little bit on the following branch. Basically just squashing a few of the patches together and reworking the commit messages. So far it's been testing well for me in the VM and if it holds up for another day or so on some real test systems I'll merge it in to master. I just wanted to make sure your OK with having this branch merged as is:

https://github.com/zfsonlinux/zfs/commits/swap

ryao · 2012-05-21T18:50:29Z

@behlendorf I am fine with that.

pyavdr · 2012-05-22T06:31:02Z

I tried to use swap on zfs with the latest commits/swap branch. Stressing suse 12.1 kernel 3.1.9 (15GB RAM, 50 GB SWAP on zvol) with pythons gives a deadlock as soon as the system starts to use the zvol/swap.

behlendorf · 2012-05-22T16:40:40Z

Can you provide a stack?

ryao · 2012-05-22T21:44:24Z

@pyavdr Did you use primarycache=all on your zvol? I use primarycache=metadata secondarycache=none on my system's swap zvol.

ryao · 2012-05-22T22:15:20Z

@behlendorf I had a deadlock last night. A 5GB memory leak occurred in knotify4 while I was recompiling my desktop. My desktop has 8GB of RAM with 8GB of swap. Nearly all RAM was in use and 5 to 6GB of swap were in use. I observed periods where X would freeze, the system would appear to do nothing and then disk activity would occur, restoring X. These periods increased in length as the system became more overloaded, reaching approximately 15 seconds, at which point, the system appeared to deadlock.

When I was reviewing code, I spotted the potential for direct reclaim to cause deadlocks in the tx_commit threads. At the time, I was unable to prove that deadlocks were either possible or impossible. Since swap did not appear to deadlock under my load tests, I left it alone.

Having thought about it in the context of last night's deadlock, I believe that the potential for deadlocks that I spotted in the tx_commit threads is the cause. Direct reclaim in a tx_commit thread will cause swap operations, which will be dispatched to other tx_commit threads. In addition, direct reclaim should not permit the original tx_commit thread to continue until the other threads have finished. Those threads can also enter direct reclaim and the tx_commit threads form a dependency tree. During that time, Linux will try to free memory and if enough is purged, then subsequent transaction group commits by other tx_commit threads will suceed, pruning the tree. If we should be so unlucky that all threads enter direct reclaim, we deadlock. The ability for us to form a dependency tree between the tx_commit threads explains why the system would hang and then suddenly become responsive upon IO activity. The ability for increasingly elaborate dependency trees could be formed also explains why the duration of the hangs become increasingly worse as my system's swap approached capacity.

I will push a patch for this issue either today or tomorrow.

pyavdr · 2012-05-23T04:39:36Z

@behlendorf
3 weeks ago i tried to catch a stack trace, as ryao provided the first patches for zfs on zvol, but never succeeded, even compiling ZFS source with DEBUG. If there is an easy "how to" to catch stacktraces when a system deadlocks and we need to hard reset, where nothing is written into the system log files, i will follow that.

pyavdr · 2012-05-23T04:49:53Z

@ryao
First time i used primarycache=all, now i set primarycache=metadata and secondarycache=none, but there is no change, fast deadlock when running 10 times python -c "print 2**10**10" , needed a hard reset.

behlendorf · 2012-05-23T16:10:08Z

@ryao I think your observations are correct, and they pretty well sum up my major fear about these patches. While I agree that we need to stop using PF_MEMALLOC and shift to KM_PUSHPAGE we absolutely need to make sure we get all instances of memory allocations under these four critical contexts before these changes are merged.

The z_wr_* threads
The txg_sync thread
The zvol write/discard threads
The zpl_putpage() VFS callback

I was thinking about how to ensure we get all the allocations in these contexts, and how to make sure we don't accidentally introduce new ones

The best way I can think of is to take the current places in the code where PF_MEMALLOC is set/cleared, and change it to a harmless no-op PF_* flag. Then we can add a patch to the spl kmem/vmem_alloc()'s to check for the flags and issue a warning to the console when the flag is set and KM_PUSHPAGE is not used. For debug builds we could make this fatal to ensure we halt the thread and can get a back trace. We'd of course need to be very careful about which PF_ bit was selected, there are some unused areas but we really can't safely count on that going forward without an autoconf check.

This should allow us to pretty easily catch all the allocations. We would no longer need to trigger the unlikely deadlock, or find it through code inspection, we can just run the code and see what gets flags. This should certainly find all the common places in the major code paths. We just wouldn't get coverage for the more unlikely error paths.

ryao · 2012-05-30T22:20:45Z

@behlendorf I like your idea to use a no-op PF_* flag so that we can track these issues. My time to work on ZFS things is currently very thin, but I will try to do this when I get a chance.

With that said, I have an additional patch that will switch to KM_PUSHPAGE in some low level areas that I had previously overlooked because I could not find clear evidence that KM_SLEEP would cause problems. I will push it to this pull request soon.

ryao · 2012-08-24T19:31:05Z

@behlendorf has made dramatic improvements upon my previous work in pull request #883. This pull request is now obsolete.

Commit eec8164 worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

The commit, cfc9a5c, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0cc eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

Commit eec8164 worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

The commit, cfc9a5c, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0cc eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#726

This was referenced May 16, 2012

z_wr_int can be interrupted when CONFIG_PREEMPT_NONE=y is set #746

Closed

ZIO writer threads steal pages from ZONE_DMA #747

Closed

Revert ce90208cf9e04df966429f115d8831371ea9afce #710

Closed

This was referenced May 19, 2012

Task hang 100% cpu with lots of memory #695

Closed

Exhaustion of ZONE_DMA openzfs/spl#116

Closed

ryao closed this Aug 24, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlocks in DMU #726

Fix deadlocks in DMU #726

ryao commented May 7, 2012

behlendorf commented May 10, 2012

ryao commented May 14, 2012

behlendorf commented May 14, 2012

ryao commented May 17, 2012

behlendorf commented May 18, 2012

behlendorf commented May 21, 2012

ryao commented May 21, 2012

pyavdr commented May 22, 2012

behlendorf commented May 22, 2012

ryao commented May 22, 2012

ryao commented May 22, 2012

pyavdr commented May 23, 2012

pyavdr commented May 23, 2012

behlendorf commented May 23, 2012

ryao commented May 30, 2012

ryao commented Aug 24, 2012

Fix deadlocks in DMU #726

Fix deadlocks in DMU #726

Conversation

ryao commented May 7, 2012

behlendorf commented May 10, 2012

ryao commented May 14, 2012

behlendorf commented May 14, 2012

ryao commented May 17, 2012

behlendorf commented May 18, 2012

behlendorf commented May 21, 2012

ryao commented May 21, 2012

pyavdr commented May 22, 2012

behlendorf commented May 22, 2012

ryao commented May 22, 2012

ryao commented May 22, 2012

pyavdr commented May 23, 2012

pyavdr commented May 23, 2012

behlendorf commented May 23, 2012

ryao commented May 30, 2012

ryao commented Aug 24, 2012