Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Stability Issue...and then long unacceptable recovery time #4199

Closed
rsvancara opened this issue Jan 10, 2016 · 8 comments
Closed

ZFS Stability Issue...and then long unacceptable recovery time #4199

rsvancara opened this issue Jan 10, 2016 · 8 comments

Comments

@rsvancara
Copy link

Extra Information:
We are experiencing serious stability problems with ZFS on Linux. We are using server class hardware with ECC memory. What I am seeing is that the filesystem is used heavily and it crashes.

When the system does crash, it takes almost 8 hours to recover. Obviously this is a production system and waiting this amount of time is completely unacceptable.

CPUs:           4
Memory:         128GB
VM/Hypervisor:        no
ECC mem:        yes
Distribution:       CentOS Linux release 7.2.1511 (Core)
Kernel version:     3.10.0-327.4.4.el7.x86_64
SPL/ZFS source: zfs-0.6.5.3-1.el7.centos.x86_64
SPL/ZFS version:    [    2.155042] SPL: Loaded module v0.6.5.3-1
[    2.204981] ZFS: Loaded module v0.6.5.3-1, ZFS pool version 5000, ZFS filesystem version 5
[  185.221977] SPL: using hostid 0x00000000
System services:    ZFS Production
Short  description: ZFS has not been stable for our environment

zpool status:

pool: data
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0 in 0h17m with 0 errors on Tue Jan 5 11:31:46 2016
config:

NAME                                             STATE     READ WRITE CKSUM
data                                             ONLINE       0     0     0
  mirror-0                                       ONLINE       0     0     0
    scsi-35000c500884e4b9b                       ONLINE       0     0     0
    scsi-35000c500884e4bab                       ONLINE       0     0     0
  mirror-1                                       ONLINE       0     0     0
    scsi-35000c500884e4d3b                       ONLINE       0     0     0
    scsi-35000c500884e5b9b                       ONLINE       0     0     0
  mirror-2                                       ONLINE       0     0     0
    scsi-35000c500884e5d03                       ONLINE       0     0     0
    scsi-35000c500884e605b                       ONLINE       0     0     0
  mirror-3                                       ONLINE       0     0     0
    scsi-35000c500884e60ff                       ONLINE       0     0     0
    scsi-35000c500884e645f                       ONLINE       0     0     0
  mirror-4                                       ONLINE       0     0     0
    scsi-35000c500884e664b                       ONLINE       0     0     0
    scsi-35000c500884e789f                       ONLINE       0     0     0
  mirror-5                                       ONLINE       0     0     0
    scsi-35000c500884e7f7f                       ONLINE       0     0     0
    scsi-35000c500884e83e3                       ONLINE       0     0     0
  mirror-6                                       ONLINE       0     0     0
    scsi-35000c500884e8d3b                       ONLINE       0     0     0
    scsi-35000c500884eb427                       ONLINE       0     0     0
  mirror-7                                       ONLINE       0     0     0
    scsi-35000c500884f59b3                       ONLINE       0     0     0
    scsi-35000c500884f5b93                       ONLINE       0     0     0
  mirror-8                                       ONLINE       0     0     0
    scsi-35000c500884f62d3                       ONLINE       0     0     0
    scsi-35000c500884f68cb                       ONLINE       0     0     0
  mirror-9                                       ONLINE       0     0     0
    scsi-35000c500884f6957                       ONLINE       0     0     0
    scsi-35000c500884f801f                       ONLINE       0     0     0
  mirror-11                                      ONLINE       0     0     0
    scsi-35000c50088212013                       ONLINE       0     0     0
    scsi-35000c50088213e9f                       ONLINE       0     0     0
logs
  ata-Samsung_SSD_840_EVO_120GB_S1D5NSBF403792M  ONLINE       0     0     0
cache
  ata-Samsung_SSD_840_EVO_120GB_S1D5NSBF403796P  ONLINE       0     0     0

The Kernel Logs are here:

https://gist.github.com/rsvancara/a3bef531f25b197deab7

zfs get all
NAME PROPERTY VALUE SOURCE
data/fastscratch type filesystem -
data/fastscratch creation Mon Jun 29 19:57 2015 -
data/fastscratch used 691G -
data/fastscratch available 10.9T -
data/fastscratch referenced 691G -
data/fastscratch compressratio 2.53x -
data/fastscratch mounted no -
data/fastscratch quota none default
data/fastscratch reservation none default
data/fastscratch recordsize 64K local
data/fastscratch mountpoint /data/fastscratch default
data/fastscratch sharenfs on local
data/fastscratch checksum on default
data/fastscratch compression on local
data/fastscratch atime off local
data/fastscratch devices on default
data/fastscratch exec on default
data/fastscratch setuid on default
data/fastscratch readonly off default
data/fastscratch zoned off default
data/fastscratch snapdir hidden default
data/fastscratch aclinherit restricted default
data/fastscratch canmount on default
data/fastscratch xattr on default
data/fastscratch copies 1 default
data/fastscratch version 5 -
data/fastscratch utf8only off -
data/fastscratch normalization none -
data/fastscratch casesensitivity sensitive -
data/fastscratch vscan off default
data/fastscratch nbmand off default
data/fastscratch sharesmb off default
data/fastscratch refquota none default
data/fastscratch refreservation none default
data/fastscratch primarycache all local
data/fastscratch secondarycache all default
data/fastscratch usedbysnapshots 0 -
data/fastscratch usedbydataset 691G -
data/fastscratch usedbychildren 0 -
data/fastscratch usedbyrefreservation 0 -
data/fastscratch logbias throughput local
data/fastscratch dedup off default
data/fastscratch mlslabel none default
data/fastscratch sync standard default
data/fastscratch refcompressratio 2.53x -
data/fastscratch written 691G -
data/fastscratch logicalused 1.70T -
data/fastscratch logicalreferenced 1.70T -
data/fastscratch filesystem_limit none default
data/fastscratch snapshot_limit none default
data/fastscratch filesystem_count none default
data/fastscratch snapshot_count none default
data/fastscratch snapdev hidden default
data/fastscratch acltype off default
data/fastscratch context none default
data/fastscratch fscontext none default
data/fastscratch defcontext none default
data/fastscratch rootcontext none default
data/fastscratch relatime off default
data/fastscratch redundant_metadata all default
data/fastscratch overlay off default

@rsvancara rsvancara changed the title ZFS Stability Issue...and then long unacceptable recover time ZFS Stability Issue...and then long unacceptable recovery time Jan 10, 2016
@behlendorf
Copy link
Contributor

@rsvancara I'd strongly encourage you up upgrade your system to the 0.6.5.4 release which was made available just a few days ago. The CentOS repositories have already been updated so you just need to update the system. It contains fixes for the majority and most common deadlocks and stability issues which have been reported. We're working on the remaining issues but since those patches are still under review and testing we didn't want them to hold up this release.

https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4

As for the worst case recovery (more like cleanup) time this is definitely something we want to tackle but haven't had a chance to yet. In principle there's no reason that recovery can't safely happen in the background after the mount completes. It's just going to take a little care to get it right and tested.

@rsvancara
Copy link
Author

Thanks, I will give 0.6.5.4 a try. Anything is better than what we have
now.

On Sun, Jan 10, 2016 at 11:07 AM, Brian Behlendorf <notifications@github.com

wrote:

@rsvancara https://github.com/rsvancara I'd strongly encourage you up
upgrade your system to the 0.6.5.4 release which was made available just a
few days ago. The CentOS repositories have already been updated so you just
need to update the system. It contains fixes for the majority and most
common deadlocks and stability issues which have been reported. We're
working on the remaining issues but since those patches are still under
review and testing we didn't want them to hold up this release.

https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4

As for the worst case recovery (more like cleanup) time this is definitely
something we want to tackle but haven't had a chance to yet. In principle
there's no reason that recovery can't safely happen in the background after
the mount completes. It's just going to take a little care to get it right
and tested.


Reply to this email directly or view it on GitHub
#4199 (comment).

Randall Svancara

@behlendorf
Copy link
Contributor

@rsvancara it would be very helpful if you could let us know what issues (if any) you're still seeing after the update. That would help us focus our efforts on the most critical remaining issues.

@rsvancara
Copy link
Author

Oh you dont have to worry about that. I have this volume that is 10TB and
I have another around 400TB that both have issues from time to time. and to
be honest, I am behind a rock and hard place in terms of making these
stable. Angry customers on one hand, my reputation on the other....

On Sun, Jan 10, 2016 at 11:44 AM, Brian Behlendorf <notifications@github.com

wrote:

@rsvancara https://github.com/rsvancara it would be very helpful if you
could let us know what issues (if any) you're still seeing after the
update. That would help us focus our efforts on the most critical remaining
issues.


Reply to this email directly or view it on GitHub
#4199 (comment).

Randall Svancara

@rsvancara
Copy link
Author

I have installed 0.6.5.1. Going to test it now....

On Sun, Jan 10, 2016 at 12:01 PM, Randall Svancara rsvancara@gmail.com
wrote:

Oh you dont have to worry about that. I have this volume that is 10TB and
I have another around 400TB that both have issues from time to time. and to
be honest, I am behind a rock and hard place in terms of making these
stable. Angry customers on one hand, my reputation on the other....

On Sun, Jan 10, 2016 at 11:44 AM, Brian Behlendorf <
notifications@github.com> wrote:

@rsvancara https://github.com/rsvancara it would be very helpful if
you could let us know what issues (if any) you're still seeing after the
update. That would help us focus our efforts on the most critical remaining
issues.


Reply to this email directly or view it on GitHub
#4199 (comment).

Randall Svancara

Randall Svancara

@behlendorf
Copy link
Contributor

@rsvancara silence is good I trust?

@rsvancara
Copy link
Author

I am having good luck so far. I have ran some tests using multiple
concurrent rsync sessions and the filesystem did not fail. So I take that
as a good sign.

On Tue, Jan 12, 2016 at 3:20 PM, Brian Behlendorf notifications@github.com
wrote:

@rsvancara https://github.com/rsvancara silence is good I trust?


Reply to this email directly or view it on GitHub
#4199 (comment).

Randall Svancara

@gmelikov
Copy link
Member

Looks like the issue is closed, feel free to reopen it if it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants