Fix mmap / libaio deadlock #7339

behlendorf · 2018-03-24T01:22:14Z

Description

Calling uiomove() in mappedread() can result in deadlock if the
user space page needs to be faulted in.

This issue is that uiomove() must be called with the page lock held
in order to safely populate the page date. If the page needs to be
faulted in by filemap_page_mkwrite() then it will also take the page
lock resulting in a double-lock.

Normally this isn't an issue since the pages are very likely to be
already faulted in. This patch makes sure that is always the case
by prefaulting in the user space pages.

Motivation and Context

Issue #7335

How Has This Been Tested?

Added test case provided by @ColinIanKing. Without the patch it deadlocks immediately as described in #7335, with the patch applied it passes 100 iterations of the test.

The test case has been added to the ZTS to prevent any future regression.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

codecov · 2018-03-26T06:04:50Z

Codecov Report

Merging #7339 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #7339      +/-   ##
==========================================
+ Coverage   76.33%   76.36%   +0.02%     
==========================================
  Files         329      329              
  Lines      104191   104191              
==========================================
+ Hits        79537    79564      +27     
+ Misses      24654    24627      -27

Flag	Coverage Δ
#kernel	`76.24% <100%> (-0.08%)`	⬇️
#user	`65.68% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5152a74...e265265. Read the comment docs.

ColinIanKing · 2018-03-26T13:30:43Z

How do we guarantee the page does not become evicted after the pre-fault and hence end up in the same page lock condition?

ColinIanKing · 2018-03-26T16:40:19Z

Just to add, this fix works fine, but I'm concerned about the unlikely event we get page-eviction post the pre-fault and hence end up in the same page lock condition.

behlendorf · 2018-03-26T18:01:00Z

How do we guarantee the page does not become evicted after the pre-fault and hence end up in the same page lock condition?

@ColinIanKing good question, technically this is possible but we're a bit more protected than it initially appears. In this case, where we're faulting in a shared page, the newly faulted page will immediately be dirtied in fault_dirty_shared_page(). Part of the reason is to avoid exactly the kind of race you're describing and is mentioned above do_wp_page.

 * We also mark the page dirty at this point even though the page will
 * change only once the write actually happens. This avoids a few races,
 * and potentially makes it more efficient.

Which means that for this page to get evicted at a minimum PG_Writeback needs to get cleared as part of the normal process of writing a page. It would then need to migrate to the head of the inactive list before it's eligible for eviction. And all of this would need to happen before mappedread() can take the needed page locks and copy the pages.

We could probably close this race entirely by registering our own custom mkwrite handler but I wanted to try and keep the fix here as minimal as possible for inclusion in the the release branch.

tuxoko · 2018-03-26T19:08:22Z

module/zcommon/zfs_uio.c

-			cnt -= incr;
+
+		if (rw == UIO_READ)
+			error = -fault_in_pages_writeable(p, cnt);


In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?
Also, fault_in_pages_writeable seems to fault in at most 2 pages.

tuxoko · 2018-03-26T19:39:26Z

@behlendorf
So I think the correct way should be unlock the page. Or even don't bother with the lock in the first place. Do you know why we lock it in the first place?

behlendorf · 2018-03-26T20:20:17Z

Do you know why we lock it in the first place?

@tuxoko the contents of the page are supposed to be updated under the page_lock to prevent page tearing when racing with like writepage. Obviously that's only going to work for page boundaries but really that's all we can gauranteee for mapped pages anyway. The rest is up to the application. We probably could relax that here, but I'd be worried about potential side effects in poorly written applications so I opted to be conserative with the suggested change.

Probably the right way to handle this long term is to register our own zpl_page_mkwrite call back. Then we could directly handle populating the page when it's being faulted in, this is what xfs and ext do. But again that's a much bigger change and I was hoping to keep this fix small and safe.

In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?

How do you mean? This does only cover one specific case and I agree it would be nice to update the test case to cover a few more variants of potential mmap / libaio races.

Also, fault_in_pages_writeable seems to fault in at most 2 pages.

Are you sure? It looks to me like it'll fault in the entire range passed.

tuxoko · 2018-03-26T21:21:46Z

the contents of the page are supposed to be updated under the page_lock to prevent page tearing when racing with like writepage.

Yes, but here we are reading the content of the page, not filling it or modifying it. There's also an ASSERT(PageUptodate(pp));. I'm not sure if the assert is correct, but if PageUptodate is a given, I don't think there's any point for us to lock the page. Even if PageUptodate is false, what we need to do is fill the page under lock, the copy operation should always be outside of lock.

In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?

What I meant is the destination and source of uiomove is the same page, so if you prefault by writing zero to destination, wouldn't it modify the source as well?

Also, fault_in_pages_writeable seems to fault in at most 2 pages.

It seems it depends on which version. https://elixir.bootlin.com/linux/v4.0.9/source/include/linux/pagemap.h#L542

behlendorf · 2018-03-26T23:41:32Z

Yes, but here we are reading the content of the page, not filling it or modifying it.

Yes good point, and additionally we do hold the inode's range lock when when accessing the page via zfs_read() or zfs_write(). So that does seem like it's sufficient, I've updated the patch to only take a hold on the page and not the lock. Time permitting I'll see about updating the test case to stress this area a bit more.

What I meant is the destination and source of uiomove is the same page, so if you prefault by writing zero to destination, wouldn't it modify the source as well?

Ahh, I see what you're getting at. I've dropped the prefault code as cleanup since it was was non-functional anyway and the updated version wasn't going to be 100% reliable.

It seems it depends on which version.

That's unfortunate, but it appears we can do without.

tuxoko · 2018-03-26T23:53:03Z

@behlendorf
Please update the commit message. Otherwise the code looks fine.
Also can you clarify what do you mean by the old prefault code is non-functional?

tuxoko · 2018-03-27T00:26:52Z

@behlendorf
It seems removing the prefault might have caused another deadlock.

[<ffffffffc06704a1>] cv_wait_common+0x101/0x140 [spl]
[<ffffffffc06704f5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffc0812035>] dmu_buf_hold_array_by_dnode+0x1b5/0x470 [zfs]
[<ffffffffc081241b>] dmu_read_impl+0xab/0x170 [zfs]
[<ffffffffc0812538>] dmu_read+0x58/0x90 [zfs]
[<ffffffffc08c76bc>] zfs_getpage+0x10c/0x1b0 [zfs]
[<ffffffffc08e46a8>] zpl_readpage+0x58/0xc0 [zfs]
[<ffffffff92bbeb1f>] filemap_fault+0x34f/0x650
[<ffffffff92bf6294>] __do_fault+0x24/0xd0
[<ffffffff92bfa5ef>] __handle_mm_fault+0xb7f/0x1080
[<ffffffff92bfabbc>] handle_mm_fault+0xcc/0x1c0
[<ffffffff92a706d8>] __do_page_fault+0x258/0x4f0
[<ffffffff92a70992>] do_page_fault+0x22/0x30
[<ffffffff934022bc>] page_fault+0x2c/0x60
[<ffffffff9330ea9c>] copy_user_generic_string+0x2c/0x40
[<ffffffffc0686e89>] uiomove+0x179/0x2a0 [zcommon]
[<ffffffffc0813736>] dmu_write_uio_dnode+0x86/0x140 [zfs]
[<ffffffffc081383e>] dmu_write_uio_dbuf+0x4e/0x70 [zfs]
[<ffffffffc08c72c6>] zfs_write+0xbe6/0xcf0 [zfs]
[<ffffffffc08e43dc>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[<ffffffffc08e4fc8>] zpl_iter_write+0xa8/0xf0 [zfs]
[<ffffffff92cac908>] aio_write+0x118/0x170
[<ffffffff92cad0f0>] do_io_submit+0x390/0x6c0
[<ffffffff92cadca0>] SyS_io_submit+0x10/0x20
[<ffffffff934001a1>] entry_SYSCALL_64_fastpath+0x24/0xab
[<ffffffffffffffff>] 0xffffffffffffffff

behlendorf · 2018-03-27T03:37:13Z

By non-functional I meant that it wasn't actually needed, although clearly that isn't the case given that stack trace. I've updated commit message, and made the minimal change to mappedread() leaving the existing prefault implementation as-is.

trisk

Pretty straightforward since we don't actually need to lock the page for uiomove.

It might be worth creating a separate issue for the (remote) possibility of a page being evicted after prefault, which would exhibit the same symptoms as removing the prefault code (for a partial write?).

tuxoko · 2018-03-27T18:55:45Z

@behlendorf
Actually, I think PageUptodate can't be guaranteed if we don't ever lock the page.
Let's do a lock cycle here. It should make sure PageUptodate given we don't have any I/O error.

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 72224a7..14caa80 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -397,8 +397,11 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
 	for (start &= PAGE_MASK; len > 0; start += PAGE_SIZE) {
 		bytes = MIN(PAGE_SIZE - off, len);
 
-		pp = find_get_page(mp, start >> PAGE_SHIFT);
+		pp = find_lock_page(mp, start >> PAGE_SHIFT);
 		if (pp) {
+			ASSERT(PageUptodate(pp));
+			unlock_page(pp);
+
 			pb = kmap(pp);
 			error = uiomove(pb + off, bytes, UIO_READ, uio);
 			kunmap(pp);

behlendorf · 2018-03-27T23:09:44Z

Refreshed.

Calling uiomove() in mappedread() can result in deadlock if the user space page needs to be faulted in. Resolve the issue by only taking a reference on the page when copying it and not the page lock. The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#7335

Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#7335 Closes openzfs#7339

Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339

behlendorf requested a review from tuxoko March 24, 2018 01:22

behlendorf mentioned this pull request Mar 24, 2018

async I/O on a zfs file system causes system hang #7335

Closed

behlendorf force-pushed the mmap branch 3 times, most recently from f05b70c to 16df384 Compare March 26, 2018 02:32

tuxoko reviewed Mar 26, 2018

View reviewed changes

behlendorf force-pushed the mmap branch from 16df384 to 4b0dafa Compare March 26, 2018 23:41

behlendorf force-pushed the mmap branch from 4b0dafa to bdca8eb Compare March 27, 2018 00:48

trisk approved these changes Mar 27, 2018

View reviewed changes

behlendorf force-pushed the mmap branch from bdca8eb to fadab98 Compare March 27, 2018 23:09

behlendorf force-pushed the mmap branch from fadab98 to 141089b Compare March 27, 2018 23:23

behlendorf force-pushed the mmap branch from 141089b to e265265 Compare March 27, 2018 23:25

tuxoko approved these changes Mar 27, 2018

View reviewed changes

behlendorf merged commit b2ab468 into openzfs:master Mar 28, 2018

gmelikov mentioned this pull request May 7, 2018

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

Closed

behlendorf deleted the mmap branch April 19, 2021 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mmap / libaio deadlock #7339

Fix mmap / libaio deadlock #7339

behlendorf commented Mar 24, 2018 •

edited

Loading

codecov bot commented Mar 26, 2018 •

edited

Loading

ColinIanKing commented Mar 26, 2018

ColinIanKing commented Mar 26, 2018

behlendorf commented Mar 26, 2018 •

edited

Loading

tuxoko Mar 26, 2018

tuxoko commented Mar 26, 2018

behlendorf commented Mar 26, 2018 •

edited by kpande

Loading

tuxoko commented Mar 26, 2018

behlendorf commented Mar 26, 2018

tuxoko commented Mar 26, 2018

tuxoko commented Mar 27, 2018 •

edited

Loading

behlendorf commented Mar 27, 2018

trisk left a comment

tuxoko commented Mar 27, 2018

behlendorf commented Mar 27, 2018

Fix mmap / libaio deadlock #7339

Fix mmap / libaio deadlock #7339

Conversation

behlendorf commented Mar 24, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Mar 26, 2018 • edited Loading

Codecov Report

ColinIanKing commented Mar 26, 2018

ColinIanKing commented Mar 26, 2018

behlendorf commented Mar 26, 2018 • edited Loading

tuxoko Mar 26, 2018

Choose a reason for hiding this comment

tuxoko commented Mar 26, 2018

behlendorf commented Mar 26, 2018 • edited by kpande Loading

tuxoko commented Mar 26, 2018

behlendorf commented Mar 26, 2018

tuxoko commented Mar 26, 2018

tuxoko commented Mar 27, 2018 • edited Loading

behlendorf commented Mar 27, 2018

trisk left a comment

Choose a reason for hiding this comment

tuxoko commented Mar 27, 2018

behlendorf commented Mar 27, 2018

behlendorf commented Mar 24, 2018 •

edited

Loading

codecov bot commented Mar 26, 2018 •

edited

Loading

behlendorf commented Mar 26, 2018 •

edited

Loading

behlendorf commented Mar 26, 2018 •

edited by kpande

Loading

tuxoko commented Mar 27, 2018 •

edited

Loading