async I/O on a zfs file system causes system hang #7335

ColinIanKing · 2018-03-22T20:43:55Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	Bionic, Artful, Xenial
Linux Kernel	4.15.0-13-generic, 4.13.0-37-generic, 4.4.0-116-generic
Architecture	x86-64
ZFS Version	0.7.5-1ubuntu6, 0.6.5.11-1ubuntu3, 0.6.5.6-0ubuntu16
SPL Version	0.7.5-1ubuntu1, 0.6.5.11-1ubuntu1, 0.6.5.6-0ubuntu4

Running the reproducer program shown below will cause a soft lockup. This occurs after the async I/O read occurs. strace on the program shows it hangs on the read io_submit. If the read/write buffer being passed is NOT mmap'd then the hang does not happen.

io_setup(1024, [0x7f0a46d58000]) = 0
openat(AT_FDCWD, "testfile", O_RDWR|O_CREAT, 04130) = 3
ftruncate(3, 512) = 0
brk(NULL) = 0x55645fa55000
brk(0x55645fa76000) = 0x55645fa76000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f0a46d7a000
io_submit(0x7f0a46d58000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., aio_nbytes=512, aio_offset=0}]) = 1
io_getevents(0x7f0a46d58000, 0, 1, [{data=0, obj=0x7ffda2080710, res=512, res2=0}], {tv_sec=30, tv_nsec=0}) = 1
io_submit(0x7f0a46d58000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7f0a46d7a000, aio_nbytes=512, aio_offset=0}]

Mar 22 17:06:39 ubuntu kernel: [  605.143429] INFO: task 5.p:4208 blocked for more than 120 seconds.
Mar 22 17:06:39 ubuntu kernel: [  605.145306]       Tainted: P           O     4.15.0-13-generic #14-Ubuntu
Mar 22 17:06:39 ubuntu kernel: [  605.147111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 22 17:06:39 ubuntu kernel: [  605.149019] 5.p             D    0  4208   4206 0x00000003
Mar 22 17:06:39 ubuntu kernel: [  605.149025] Call Trace:
Mar 22 17:06:39 ubuntu kernel: [  605.149095]  __schedule+0x297/0x8b0
Mar 22 17:06:39 ubuntu kernel: [  605.149112]  schedule+0x2c/0x80
Mar 22 17:06:39 ubuntu kernel: [  605.149137]  io_schedule+0x16/0x40
Mar 22 17:06:39 ubuntu kernel: [  605.149153]  __lock_page+0xff/0x140
Mar 22 17:06:39 ubuntu kernel: [  605.149159]  ? page_cache_tree_insert+0xe0/0xe0
Mar 22 17:06:39 ubuntu kernel: [  605.149163]  filemap_page_mkwrite+0xb3/0xc0
Mar 22 17:06:39 ubuntu kernel: [  605.149171]  do_page_mkwrite+0x35/0x90
Mar 22 17:06:39 ubuntu kernel: [  605.149187]  do_wp_page+0x22b/0x600
Mar 22 17:06:39 ubuntu kernel: [  605.149193]  ? mutex_lock+0x12/0x40
Mar 22 17:06:39 ubuntu kernel: [  605.149197]  handle_pte_fault+0x69d/0xdb0
Mar 22 17:06:39 ubuntu kernel: [  605.149201]  ? mutex_lock+0x12/0x40
Mar 22 17:06:39 ubuntu kernel: [  605.149210]  __handle_mm_fault+0x47b/0x5c0
Mar 22 17:06:39 ubuntu kernel: [  605.149216]  handle_mm_fault+0xb1/0x1f0
Mar 22 17:06:39 ubuntu kernel: [  605.149226]  __do_page_fault+0x250/0x4d0
Mar 22 17:06:39 ubuntu kernel: [  605.149230]  do_page_fault+0x2e/0xe0
Mar 22 17:06:39 ubuntu kernel: [  605.149240]  do_async_page_fault+0x51/0x80
Mar 22 17:06:39 ubuntu kernel: [  605.149245]  async_page_fault+0x25/0x50
Mar 22 17:06:39 ubuntu kernel: [  605.149252] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
Mar 22 17:06:39 ubuntu kernel: [  605.149254] RSP: 0018:ffffb1e28848fb80 EFLAGS: 00010206
Mar 22 17:06:39 ubuntu kernel: [  605.149257] RAX: 00007f2e2648c200 RBX: 0000000000000200 RCX: 0000000000000200
Mar 22 17:06:39 ubuntu kernel: [  605.149260] RDX: 0000000000000200 RSI: ffff88ffdf518000 RDI: 00007f2e2648c000
Mar 22 17:06:39 ubuntu kernel: [  605.149261] RBP: ffffb1e28848fb88 R08: ffffb1e288490000 R09: ffffdd05047d4600
Mar 22 17:06:39 ubuntu kernel: [  605.149263] R10: 0000000000000040 R11: ffffb1e28848fb68 R12: ffffb1e28848fc60
Mar 22 17:06:39 ubuntu kernel: [  605.149265] R13: 0000000000000000 R14: ffff88ffdf518000 R15: ffffb1e28848fd70
Mar 22 17:06:39 ubuntu kernel: [  605.149279]  ? _copy_to_user+0x26/0x30
Mar 22 17:06:39 ubuntu kernel: [  605.149291]  uiomove+0x116/0x260 [zcommon]
Mar 22 17:06:39 ubuntu kernel: [  605.149388]  zfs_read+0x1dd/0x3f0 [zfs]
Mar 22 17:06:39 ubuntu kernel: [  605.149465]  zpl_read_common_iovec+0x80/0xd0 [zfs]
Mar 22 17:06:39 ubuntu kernel: [  605.149537]  zpl_iter_read+0x9c/0xe0 [zfs]
Mar 22 17:06:39 ubuntu kernel: [  605.149562]  aio_read+0xee/0x150
Mar 22 17:06:39 ubuntu kernel: [  605.149567]  ? _cond_resched+0x19/0x40
Mar 22 17:06:39 ubuntu kernel: [  605.149572]  ? ptrace_stop+0x1a6/0x260
Mar 22 17:06:39 ubuntu kernel: [  605.149576]  ? _cond_resched+0x19/0x40
Mar 22 17:06:39 ubuntu kernel: [  605.149579]  do_io_submit+0x46a/0x670
Mar 22 17:06:39 ubuntu kernel: [  605.149583]  ? do_io_submit+0x46a/0x670
Mar 22 17:06:39 ubuntu kernel: [  605.149587]  ? ptrace_notify+0x5b/0x90
Mar 22 17:06:39 ubuntu kernel: [  605.149592]  SyS_io_submit+0x10/0x20
Mar 22 17:06:39 ubuntu kernel: [  605.149595]  ? SyS_io_submit+0x10/0x20
Mar 22 17:06:39 ubuntu kernel: [  605.149606]  do_syscall_64+0x73/0x130
Mar 22 17:06:39 ubuntu kernel: [  605.149612]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Describe how to reproduce the problem

Compile the C source listed below and run:

gcc test.c -laio
run on a ZFS file system:

./a.out

..and it will hang and get soft lockup timeouts after 120 seconds on Ubuntu.

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <libaio.h>
#include <fcntl.h>
#include <err.h>

io_context_t io_ctx;

void do_sync_io(struct iocb *iocb)
{
        struct io_event event;
        struct iocb *iocbs[] = { iocb };
        struct timespec ts = { 30, 0 };

        if (io_submit(io_ctx, 1, iocbs) != 1)
                err(1, "io_submit failed");
        if (io_getevents(io_ctx, 0, 1, &event, &ts) != 1)
                err(1, "io_getevents failed");
}

int main(void)
{
        char *buf;
        int rwfd, status = 0, res, page_size = getpagesize();
        struct iocb iocb;

        if (io_queue_init(1024, &io_ctx))
                err(1, "io_queue_init failed");
        rwfd = open("testfile", O_RDWR | O_CREAT);
        if (rwfd < 0)
                err(1, "open failed");
        if (ftruncate(rwfd, 512) < 0)
                err(1, "ftruncate failed");
        buf = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, rwfd, 0);
        if (buf == MAP_FAILED)
                err(1, "mmap failed");

        (void)io_prep_pwrite(&iocb, rwfd, buf, 512, 0);
        do_sync_io(&iocb);

        (void)io_prep_pread(&iocb, rwfd, buf, 512, 0);
        do_sync_io(&iocb);
}

The text was updated successfully, but these errors were encountered:

ColinIanKing · 2018-03-23T11:29:55Z

I suspect this happens because the page is being locked in mappedread and again in filemap_page_mkwrite. Since the page is already locked, the locking of the page in filemap_page_mkwrite causes the task to block indefinitely.

ColinIanKing · 2018-03-23T12:41:16Z

Touching the mmap'd buffer (e.g. *buf = 0;) before the read on the final do_sync_io() call causes the page to be faulted and then we don't get the hang

ColinIanKing · 2018-03-23T12:42:47Z

issue occurs with latest (today's) zfs + spl on 4.15

ColinIanKing · 2018-03-23T13:41:17Z

This helps, but is clearly sub-optimal.

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index a2d7d7b24..ad4ce7eb6 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -391,6 +391,11 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
        int len = nbytes;
        int error = 0;
        void *pb;
+       void *p;
+
+       p = kmalloc(PAGE_SIZE, GFP_KERNEL);
+       if (!p)
+               return ENOMEM;
 
        start = uio->uio_loffset;
        off = start & (PAGE_SIZE-1);
@@ -402,7 +407,7 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
                        ASSERT(PageUptodate(pp));
 
                        pb = kmap(pp);
-                       error = uiomove(pb + off, bytes, UIO_READ, uio);
+                       memcpy(p, pb + off, bytes);
                        kunmap(pp);
 
                        if (mapping_writably_mapped(mp))
@@ -411,6 +416,8 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
                        mark_page_accessed(pp);
                        unlock_page(pp);
                        put_page(pp);
+
+                       error = uiomove(p, bytes, UIO_READ, uio);
                } else {
                        error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl),
                            uio, bytes);
@@ -421,6 +428,8 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
                if (error)
                        break;
        }
+
+       kfree(p);
        return (error);
 }

ColinIanKing · 2018-03-23T13:47:28Z

Clearly it's a double-lock on the page causing the problem, if I unlock it before the uiomove we don't get the lockup hang:

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index a2d7d7b24..3baa04e7a 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -402,7 +402,9 @@ mappedread(struct inode *ip, int nbytes, uio_t *uio)
                        ASSERT(PageUptodate(pp));
 
                        pb = kmap(pp);
-                       error = uiomove(pb + off, bytes, UIO_READ, uio);
+                       unlock_page(pp);
+                       uiomove(pb + off, bytes, UIO_READ, uio);
+                       lock_page(pp);
                        kunmap(pp);
 
                        if (mapping_writably_mapped(mp))

Any ideas of a suitable fix?

Calling uiomove() in mappedread() can result in deadlock if the user space page needs to be faulted in. This issue is that uiomove() must be called with the page lock held in order to safely populate the page date. If the page needs to be faulted in by filemap_page_mkwrite() then it will also take the page lock resulting in a double-lock. Normally this isn't an issue since the pages are very likely to be already faulted in. This patch makes sure that is always the case by prefaulting in the user space pages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#7335

behlendorf · 2018-03-24T01:25:27Z

@ColinIanKing thanks for getting to the heart of the matter for putting together such a nice test case! I've opened #7339 with a proposed fix for your review and verification. What I'm suggesting is that we simply always ensure the page is prefaulted.

ColinIanKing · 2018-03-24T10:55:10Z

Thanks for the speedy turnaround on this fix. I'll test it out on Monday with our own Canonical file system tests in the mix to give it a thorough test.

Calling uiomove() in mappedread() can result in deadlock if the user space page needs to be faulted in. This issue is that uiomove() must be called with the page lock held in order to safely populate the page date. If the page needs to be faulted in by filemap_page_mkwrite() then it will also take the page lock resulting in a double-lock. Normally this isn't an issue since the pages are very likely to be already faulted in. This patch makes sure that is always the case by prefaulting in the user space pages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#7335

Calling uiomove() in mappedread() can result in deadlock if the user space page needs to be faulted in. Resolve the issue by only taking a reference on the page when copying it and not the page lock. The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#7335

Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#7335 Closes openzfs#7339

Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339

behlendorf mentioned this issue Mar 24, 2018

Fix mmap / libaio deadlock #7339

Merged

13 tasks

behlendorf added this to TODO in 0.7.9 Mar 27, 2018

behlendorf closed this as completed in b2ab468 Mar 28, 2018

tonyhutter moved this from TODO to In progress in 0.7.9 Apr 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async I/O on a zfs file system causes system hang #7335

async I/O on a zfs file system causes system hang #7335

ColinIanKing commented Mar 22, 2018

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018 •

edited

Loading

ColinIanKing commented Mar 23, 2018 •

edited

Loading

behlendorf commented Mar 24, 2018

ColinIanKing commented Mar 24, 2018

async I/O on a zfs file system causes system hang #7335

async I/O on a zfs file system causes system hang #7335

Comments

ColinIanKing commented Mar 22, 2018

System information

Describe how to reproduce the problem

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018

ColinIanKing commented Mar 23, 2018 • edited Loading

ColinIanKing commented Mar 23, 2018 • edited Loading

behlendorf commented Mar 24, 2018

ColinIanKing commented Mar 24, 2018

ColinIanKing commented Mar 23, 2018 •

edited

Loading

ColinIanKing commented Mar 23, 2018 •

edited

Loading