Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mmap / libaio deadlock #7339

Merged
merged 1 commit into from
Mar 28, 2018
Merged

Fix mmap / libaio deadlock #7339

merged 1 commit into from
Mar 28, 2018

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Mar 24, 2018

Description

Calling uiomove() in mappedread() can result in deadlock if the
user space page needs to be faulted in.

This issue is that uiomove() must be called with the page lock held
in order to safely populate the page date. If the page needs to be
faulted in by filemap_page_mkwrite() then it will also take the page
lock resulting in a double-lock.

Normally this isn't an issue since the pages are very likely to be
already faulted in. This patch makes sure that is always the case
by prefaulting in the user space pages.

Motivation and Context

Issue #7335

How Has This Been Tested?

Added test case provided by @ColinIanKing. Without the patch it deadlocks immediately as described in #7335, with the patch applied it passes 100 iterations of the test.

The test case has been added to the ZTS to prevent any future regression.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@codecov
Copy link

codecov bot commented Mar 26, 2018

Codecov Report

Merging #7339 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7339      +/-   ##
==========================================
+ Coverage   76.33%   76.36%   +0.02%     
==========================================
  Files         329      329              
  Lines      104191   104191              
==========================================
+ Hits        79537    79564      +27     
+ Misses      24654    24627      -27
Flag Coverage Δ
#kernel 76.24% <100%> (-0.08%) ⬇️
#user 65.68% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5152a74...e265265. Read the comment docs.

@ColinIanKing
Copy link
Contributor

How do we guarantee the page does not become evicted after the pre-fault and hence end up in the same page lock condition?

@ColinIanKing
Copy link
Contributor

Just to add, this fix works fine, but I'm concerned about the unlikely event we get page-eviction post the pre-fault and hence end up in the same page lock condition.

@behlendorf
Copy link
Contributor Author

behlendorf commented Mar 26, 2018

How do we guarantee the page does not become evicted after the pre-fault and hence end up in the same page lock condition?

@ColinIanKing good question, technically this is possible but we're a bit more protected than it initially appears. In this case, where we're faulting in a shared page, the newly faulted page will immediately be dirtied in fault_dirty_shared_page(). Part of the reason is to avoid exactly the kind of race you're describing and is mentioned above do_wp_page.

 * We also mark the page dirty at this point even though the page will
 * change only once the write actually happens. This avoids a few races,
 * and potentially makes it more efficient.

Which means that for this page to get evicted at a minimum PG_Writeback needs to get cleared as part of the normal process of writing a page. It would then need to migrate to the head of the inactive list before it's eligible for eviction. And all of this would need to happen before mappedread() can take the needed page locks and copy the pages.

We could probably close this race entirely by registering our own custom mkwrite handler but I wanted to try and keep the fix here as minimal as possible for inclusion in the the release branch.

cnt -= incr;

if (rw == UIO_READ)
error = -fault_in_pages_writeable(p, cnt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?
Also, fault_in_pages_writeable seems to fault in at most 2 pages.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2018

@behlendorf
So I think the correct way should be unlock the page. Or even don't bother with the lock in the first place. Do you know why we lock it in the first place?

@behlendorf
Copy link
Contributor Author

behlendorf commented Mar 26, 2018

Do you know why we lock it in the first place?

@tuxoko the contents of the page are supposed to be updated under the page_lock to prevent page tearing when racing with like writepage. Obviously that's only going to work for page boundaries but really that's all we can gauranteee for mapped pages anyway. The rest is up to the application. We probably could relax that here, but I'd be worried about potential side effects in poorly written applications so I opted to be conserative with the suggested change.

Probably the right way to handle this long term is to register our own zpl_page_mkwrite call back. Then we could directly handle populating the page when it's being faulted in, this is what xfs and ext do. But again that's a much bigger change and I was hoping to keep this fix small and safe.

In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?

How do you mean? This does only cover one specific case and I agree it would be nice to update the test case to cover a few more variants of potential mmap / libaio races.

Also, fault_in_pages_writeable seems to fault in at most 2 pages.

Are you sure? It looks to me like it'll fault in the entire range passed.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2018

the contents of the page are supposed to be updated under the page_lock to prevent page tearing when racing with like writepage.

Yes, but here we are reading the content of the page, not filling it or modifying it. There's also an ASSERT(PageUptodate(pp));. I'm not sure if the assert is correct, but if PageUptodate is a given, I don't think there's any point for us to lock the page. Even if PageUptodate is false, what we need to do is fill the page under lock, the copy operation should always be outside of lock.

In the provided reproducer, the destination page is the same as the source page. Wouldn't doing this modify the result?

What I meant is the destination and source of uiomove is the same page, so if you prefault by writing zero to destination, wouldn't it modify the source as well?

Also, fault_in_pages_writeable seems to fault in at most 2 pages.

It seems it depends on which version. https://elixir.bootlin.com/linux/v4.0.9/source/include/linux/pagemap.h#L542

@behlendorf
Copy link
Contributor Author

Yes, but here we are reading the content of the page, not filling it or modifying it.

Yes good point, and additionally we do hold the inode's range lock when when accessing the page via zfs_read() or zfs_write(). So that does seem like it's sufficient, I've updated the patch to only take a hold on the page and not the lock. Time permitting I'll see about updating the test case to stress this area a bit more.

What I meant is the destination and source of uiomove is the same page, so if you prefault by writing zero to destination, wouldn't it modify the source as well?

Ahh, I see what you're getting at. I've dropped the prefault code as cleanup since it was was non-functional anyway and the updated version wasn't going to be 100% reliable.

It seems it depends on which version.

That's unfortunate, but it appears we can do without.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2018

@behlendorf
Please update the commit message. Otherwise the code looks fine.
Also can you clarify what do you mean by the old prefault code is non-functional?

@tuxoko
Copy link
Contributor

tuxoko commented Mar 27, 2018

@behlendorf
It seems removing the prefault might have caused another deadlock.

[<ffffffffc06704a1>] cv_wait_common+0x101/0x140 [spl]
[<ffffffffc06704f5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffc0812035>] dmu_buf_hold_array_by_dnode+0x1b5/0x470 [zfs]
[<ffffffffc081241b>] dmu_read_impl+0xab/0x170 [zfs]
[<ffffffffc0812538>] dmu_read+0x58/0x90 [zfs]
[<ffffffffc08c76bc>] zfs_getpage+0x10c/0x1b0 [zfs]
[<ffffffffc08e46a8>] zpl_readpage+0x58/0xc0 [zfs]
[<ffffffff92bbeb1f>] filemap_fault+0x34f/0x650
[<ffffffff92bf6294>] __do_fault+0x24/0xd0
[<ffffffff92bfa5ef>] __handle_mm_fault+0xb7f/0x1080
[<ffffffff92bfabbc>] handle_mm_fault+0xcc/0x1c0
[<ffffffff92a706d8>] __do_page_fault+0x258/0x4f0
[<ffffffff92a70992>] do_page_fault+0x22/0x30
[<ffffffff934022bc>] page_fault+0x2c/0x60
[<ffffffff9330ea9c>] copy_user_generic_string+0x2c/0x40
[<ffffffffc0686e89>] uiomove+0x179/0x2a0 [zcommon]
[<ffffffffc0813736>] dmu_write_uio_dnode+0x86/0x140 [zfs]
[<ffffffffc081383e>] dmu_write_uio_dbuf+0x4e/0x70 [zfs]
[<ffffffffc08c72c6>] zfs_write+0xbe6/0xcf0 [zfs]
[<ffffffffc08e43dc>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[<ffffffffc08e4fc8>] zpl_iter_write+0xa8/0xf0 [zfs]
[<ffffffff92cac908>] aio_write+0x118/0x170
[<ffffffff92cad0f0>] do_io_submit+0x390/0x6c0
[<ffffffff92cadca0>] SyS_io_submit+0x10/0x20
[<ffffffff934001a1>] entry_SYSCALL_64_fastpath+0x24/0xab
[<ffffffffffffffff>] 0xffffffffffffffff

@behlendorf
Copy link
Contributor Author

By non-functional I meant that it wasn't actually needed, although clearly that isn't the case given that stack trace. I've updated commit message, and made the minimal change to mappedread() leaving the existing prefault implementation as-is.

Copy link
Contributor

@trisk trisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty straightforward since we don't actually need to lock the page for uiomove.

It might be worth creating a separate issue for the (remote) possibility of a page being evicted after prefault, which would exhibit the same symptoms as removing the prefault code (for a partial write?).

@tuxoko
Copy link
Contributor

tuxoko commented Mar 27, 2018

@behlendorf
Actually, I think PageUptodate can't be guaranteed if we don't ever lock the page.
Let's do a lock cycle here. It should make sure PageUptodate given we don't have any I/O error.

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 72224a7..14caa80 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -397,8 +397,11 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
 	for (start &= PAGE_MASK; len > 0; start += PAGE_SIZE) {
 		bytes = MIN(PAGE_SIZE - off, len);
 
-		pp = find_get_page(mp, start >> PAGE_SHIFT);
+		pp = find_lock_page(mp, start >> PAGE_SHIFT);
 		if (pp) {
+			ASSERT(PageUptodate(pp));
+			unlock_page(pp);
+
 			pb = kmap(pp);
 			error = uiomove(pb + off, bytes, UIO_READ, uio);
 			kunmap(pp);

@behlendorf
Copy link
Contributor Author

Refreshed.

Calling uiomove() in mappedread() can result in deadlock if the
user space page needs to be faulted in.

Resolve the issue by only taking a reference on the page when
copying it and not the page lock.  The inode range lock protects
against concurrent updates via zfs_read() and zfs_write().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#7335
@behlendorf behlendorf merged commit b2ab468 into openzfs:master Mar 28, 2018
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 16, 2018
Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#7335
Closes openzfs#7339
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 4, 2018
Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#7335
Closes openzfs#7339
tonyhutter pushed a commit that referenced this pull request May 10, 2018
Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7335
Closes #7339
@behlendorf behlendorf deleted the mmap branch April 19, 2021 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants