Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pwrite returns EFAULT (bad address) for zfs raid under high load #8640

Closed
peter-roesch opened this issue Apr 18, 2019 · 11 comments
Closed

pwrite returns EFAULT (bad address) for zfs raid under high load #8640

peter-roesch opened this issue Apr 18, 2019 · 11 comments
Labels
Projects

Comments

@peter-roesch
Copy link

@peter-roesch peter-roesch commented Apr 18, 2019

System information

Type Version/Name
Distribution Name CentOS Linux
Distribution Version 7.6.1810
Linux Kernel 3.10.0-957.1.3.el7.x86_64
Architecture x86_64
ZFS Version 0.7.13-1
SPL Version 0.7.13-1

Describe the problem you're observing

We are running our BeeGFS Files system version 7.1.2 on top of ZFS.
Under high load, our storage server receives a 'Bad address' return code from pwrite.
The access pattern are many 128k writes from multiple threads within one process.
The ZFS raid has been setup as a raidz2
zpool create storage01 raidz2 sda sdn sdo sdp sdq sdr sds sdt sdu sdv
But we also were to reproduce the issue with a simple setup
zpool create storage01 sda sdn sdo sdp sdq sdr sds sdt sdu sdv
If we just use a single disk with zfs, the issue does not occur.

Also, if we use other file systems (e.g. xfs, ext4), the issue does not occur,
it is only reproducible with zfs.

We tested this with v07.13, v0.7.12, v0.7.11 and v0.6.5.11

Describe how to reproduce the problem

Setup BeeGFS with an underlying ZFS raid with many disks.

run iozone on top of that:
~/iozone3_487/src/current/iozone -+m ~/iozone_nodesfile -t 50 -r 64k -i 0 -s 1G

After 3-4 runs, this produces the error with the pwrite call

From the beegfs server's point of view: A buffer is allocated at the start of the program. Right before the pwrite, the buffer is filled with data from the network interface, the server parses the header alright, and then hands a pointer into the buffer on to pwrite. From our point of view there is no reason to believe the pointer actually points to an invalid address.

Include any warning/errors/backtraces from the system logs

There is nothing in the syslog related to the issue.

Any advise / hint on what we could do to isolate the issue is highly appreciated.

For example:

  • are there any other configuration options that we could try ?
  • what other kind of setups make sense for testing ? (e.g. should we test with 0.8.x ?)
  • should we try to isolate the test (means decouple it from BeeGFS, which is some work ?), or would strace log be enough ?
@peter-roesch

This comment has been minimized.

Copy link
Author

@peter-roesch peter-roesch commented Apr 23, 2019

In the meantime, we have established a workaround in BeeGFS for this issue:
exactly the same pwrite call is executed once again and then it works without any issues.
In our opinion, this is a strong hint, that there is an issue (e.g. a race condition) in ZFS. Otherwise, we would expect that the same call would result in the same error message (Bad address). It sounds very unlikely, that a bad address issue would heal itself once you call a function once again ;-)

Another observation is, that the issue occurs only, if ZFS is setup with more than one disk. It does not occur if you use a single disk only.

@behlendorf

This comment has been minimized.

Copy link
Member

@behlendorf behlendorf commented Apr 23, 2019

are there any other configuration options that we could try?

Could you please try the 0.7.12 release or reverting this specific commit 98bb45e , which was first added to the 0.7.13 release. My best guess is for some reason the pages received from the network aren't yet available to be prefaulted (but soon will be).

what other kind of setups make sense for testing ? (e.g. should we test with 0.8.x ?)

If you're able to test with 0.8.0-rc4 that would be great. Though if this issue is caused by 98bb45e as suspected the error should still be present.

should we try to isolate the test (means decouple it from BeeGFS, which is some work ?), or would strace log be enough ?

Getting a kernel stack trace from the offending pwrite call should be enough to confirm we're on the right track. Being able to reproduce this without BeeGFS would be helpful, but I suspect the critical part is filling the buffer with data from the network interface right before the pwrite.

Let's see if we can get @wgqimut's thoughts on this since he's familiar with this area of the code.

@peter-roesch

This comment has been minimized.

Copy link
Author

@peter-roesch peter-roesch commented Apr 24, 2019

We tested this with v07.13, v0.7.12, v0.7.11 and v0.6.5.11 and the error has occurred for all of these versions.
Providing a kernel stack trace should be easy, so we will try that.
Another idea was to protect the memory region with an mlock call and to test if this changes the behavior. We will let you know if this approach fixes the issue as well.

@peter-roesch

This comment has been minimized.

Copy link
Author

@peter-roesch peter-roesch commented Apr 26, 2019

a couple of more notes:
we have tested with 0.8.0rc4 as well and have the same effect.
we tried the 'mlock' approach, but that didn't make any difference

So, now we want to followup on the kernel stack trace generation.
Can you please give us some advise / hints on how to do that; mainly where should we set the breakpoint ? (for that test, we would go back to the most recent stable release 7.13)

@jthiltges

This comment has been minimized.

Copy link

@jthiltges jthiltges commented Apr 29, 2019

I stumbled across this issue as well, with a very similar environment: CentOS 7, BeeGFS v7.1.2 and ZFS 0.7.13. We've been able to replicate it with fio (directly on ZFS volumes, no BeeGFS) as well.

Out of six servers with an identical software stack, the issue appears on two 8-core servers, but does not appear on the other four 4-core servers.

Type Version/Name
Distribution Name CentOS Linux
Distribution Version 7.6.1810
Linux Kernel 3.10.0-957.10.1.el7.x86_64
Architecture x86_64
ZFS Version 0.7.13-1
SPL Version 0.7.13-1
BeeGFS Version 7.1.2

The servers are configured with 6 pools, each with 8 drives in a raidz2 configuration. No cache or log devices are currently configured.

We first noticed the issue when running the BeeGFS storage benchmark:

The following example starts a write benchmark on all targets of all BeeGFS storage servers with an IO blocksize of 512 KB, using 6 threads (i.e. simulated client streams) per target, each of which will write 200 GB of data to its own file.

beegfs-ctl --storagebench --alltargets --write --blocksize=512K --size=200G --threads=6

After a few minutes, the following error appeared in beegfs-storage.log

(1) Apr29 13:47:07 StorageBenchSlave [Storage Benchmark (run)] >> Benchmark started...
(0) Apr29 13:52:09 Worker8-1 [Storage Benchmark (run)] >> Error: I/O failure. SysErr: Bad address

The following fio script approximates the BeeGFS benchmark and causes the error to show up within a minute or two on our test system. Enabling threading with thread=1 seems to be significant, as well as running on an 8-core server (rather than 4-core).

[global]
rw=write
bs=128k
size=200g
numjobs=6
fallocate=none
ioengine=sync
thread=1

[job1]
filename_format=/data/storage01/benchmark/fio.$jobnum
[job2]
filename_format=/data/storage02/benchmark/fio.$jobnum
[job3]
filename_format=/data/storage03/benchmark/fio.$jobnum
[job4]
filename_format=/data/storage04/benchmark/fio.$jobnum
[job5]
filename_format=/data/storage05/benchmark/fio.$jobnum
[job6]
filename_format=/data/storage06/benchmark/fio.$jobnum

fio errors:

fio: io_u error on file /data/storage06/benchmark/fio.1: Bad address: write offset=2494824448, buflen=131072
fio: io_u error on file /data/storage02/benchmark/fio.4: Bad address: write offset=2065170432, buflen=131072
fio: io_u error on file /data/storage05/benchmark/fio.5: Bad address: write offset=1935802368, buflen=131072
@jthiltges

This comment has been minimized.

Copy link

@jthiltges jthiltges commented Apr 30, 2019

Prodding with systemtap, the EFAULT in zfs_write() seems to be coming from here:
https://github.com/zfsonlinux/zfs/blob/zfs-0.7.13/module/zcommon/zfs_uio.c#L90

After downgrading from 0.7.13 to 0.7.12, I can no longer reproduce the issue with fio or the BeeGFS benchmark. It looks likely that 98bb45e is the cause for us.

@behlendorf

This comment has been minimized.

Copy link
Member

@behlendorf behlendorf commented Apr 30, 2019

@jthiltges that would make sense. Then it sounds like we need to determine what copy_from_user() is additionally doing to handle the fault that the change in 98bb45e isn't. @peter-roesch I know you already tested this, but I think it would be a good idea to verify that with 0.7.12 you are able to reproduce this. When performing your testing you can run cat /sys/module/zfs/version to be absolutely sure.

@peter-roesch

This comment has been minimized.

Copy link
Author

@peter-roesch peter-roesch commented May 6, 2019

yep, indeed, obviously my downgrade of zfs was not done right. The kernel module was still sticky and was a newer version. I managed now to move back to 0.7.12 and could not reproduce the issue with BeeGFS any more. Just to be sure, I have upgraded zfs again to 0.7.13 and could then reproduce again. So, all of the above is confirmed from my side.

behlendorf added a commit to behlendorf/zfs that referenced this issue May 6, 2019
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux#8640
behlendorf added a commit to behlendorf/zfs that referenced this issue May 6, 2019
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux#8640
@behlendorf behlendorf mentioned this issue May 6, 2019
5 of 12 tasks complete
@behlendorf

This comment has been minimized.

Copy link
Member

@behlendorf behlendorf commented May 6, 2019

I've opened a PR #8719 with the fix for this issue. Thank you for providing the fio test case, with it I was able to easily reproduce the failure and get to the root cause. Once the fix is applied to master we'll get it backported for 0.7.14.

behlendorf added a commit to behlendorf/zfs that referenced this issue May 7, 2019
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux#8640
@behlendorf behlendorf closed this in 515ddf6 May 8, 2019
@jthiltges

This comment has been minimized.

Copy link

@jthiltges jthiltges commented May 8, 2019

Thank you Peter for opening the issue with the great description of the problem. And thanks so much for your help with the fix, Brian.

@gdevenyi

This comment has been minimized.

Copy link

@gdevenyi gdevenyi commented May 25, 2019

This issue has not been added to the 0.7.14 tracking.

@behlendorf behlendorf added this to To do in 0.7.14 May 25, 2019
allanjude added a commit to allanjude/zfs that referenced this issue Jun 7, 2019
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux#8640 
Closes zfsonlinux#8719
allanjude added a commit to allanjude/zfs that referenced this issue Jun 15, 2019
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux#8640 
Closes zfsonlinux#8719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
0.7.14
  
To do
4 participants
You can’t perform that action at this time.