Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible CKSUM errors after 2 drives taken OFFLINE on raidz2 #5806

Closed
thegreatgazoo opened this issue Feb 18, 2017 · 8 comments · Fixed by #6103
Closed

Reproducible CKSUM errors after 2 drives taken OFFLINE on raidz2 #5806

thegreatgazoo opened this issue Feb 18, 2017 · 8 comments · Fixed by #6103
Milestone

Comments

@thegreatgazoo
Copy link

thegreatgazoo commented Feb 18, 2017

I was able to 100% reproduce it on current/vanilla master (spl at 9704820, zfs at 100790a). Simple to reproduce:

zpool create -f -o ashift=12 tank raidz2 sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq
zpool offline tank sde
zpool offline tank sdn
cp -a spl zfs /tank
zpool online tank sde
zpool status
  scan: resilvered **39.0M** in 0h0m with 0 errors on Sat Feb 18 04:47:19 2017
zpool scrub tank
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Feb 18 04:47:50 2017
zpool online tank sdn
zpool status
  scan: resilvered **12K** in 0h0m with 0 errors on Sat Feb 18 04:48:26 2017
zpool scrub tank
zpool status
  scan: scrub repaired **38.7M** in 0h0m with 0 errors on Sat Feb 18 04:49:02 2017
config:
        NAME        STATE     READ WRITE CKSUM
            sdn     ONLINE       0     0 5.04K

The two drives were taken offline right after pool creation, so the amount of lost data/parity on them should be fairly close. Looking at the bold text above, the sde resilver fixed 39.0M but the sdn resilver only fixed 12K. So it looked like the 2nd resilver missed quite some blocks. The 2nd scrub fixed 38.7M, and if we add that to the 12K fixed by the 2nd resilver, it'd get fairly close to the 39.0M fixed by the 1st resilver. So looked like the 2nd scrub was actually fixing the blocks missed by the 2nd resilver.

Since the difference between resilver and scrub is that resilver would look at DTL_PARTIAL to decide whether to check a block, I guess something messed up the DTLs before the 2nd resilver - therefore the 1st scrub looked very fishy. Then I did the same thing again, except I didn't do the scrub between the 2 resilvers:

# zpool create -f -o ashift=12 tank raidz2 sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq
# zpool offline tank sde
# zpool offline tank sdn
# cp -a spl zfs /tank
# zpool online tank sde
# zpool status
  scan: resilvered 38.8M in 0h0m with 0 errors on Sat Feb 18 05:02:24 2017
# zpool online tank sdn
# zpool status
  scan: resilvered 38.8M in 0h0m with 0 errors on Sat Feb 18 05:03:00 2017
# zpool status
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Feb 18 05:03:30 2017

This time there's 0 error, and the 2 resilvers fixed about the same amount of data/parity. Everything above is 100% repeatable. Which seemed to verify my guess that the scrub between resilvers messed the DTL somehow.

The resilver/scrub and raidz code really hasn't changed much - I also used zfs_vdev_raidz_impl="original" to disable the new fancy parity routines - so I'd suspect it'd affect older ZFS versions as well, maybe even ZFS on other OS. This is probably something we'd want to fix before the next release. I realized the 14-drive raidz2 I used in the tests was not a common configuration, but it's not crazy either.

@behlendorf behlendorf added this to the 0.7.0 milestone Feb 21, 2017
@behlendorf
Copy link
Contributor

Which seemed to verify my guess that the scrub between resilvers messed the DTL somehow.

It definitely appears that way. It looks as if dsl_scan_done()->vdev_dtl_reassess()->vdev_dtl_should_excise() can update the vdev's DTL in memory even when leaf-vdev is offline. If it's not updated properly when the leaf-vdev is re-opened that could explain what's going on.

I agree we should address this before the next tag.

@pcd1193182
Copy link
Contributor

I can confirm that this reproduces on Illumos.

@ahrens
Copy link
Member

ahrens commented Feb 24, 2017

@grwilson may also be interested in this.

@loli10K
Copy link
Contributor

loli10K commented Apr 25, 2017

Is anyone already working on this? I think this is also reproducible on 2+ disks raidz1, which seems quite troubling.

@thegreatgazoo
Copy link
Author

@loli10K Do you have a way to reproduce it on raidz1? I had to take 2 drives offline and do IO to reproduce on raidz2, but raidz1 can't do any IO if two drives have been taken offline.

@loli10K
Copy link
Contributor

loli10K commented Apr 26, 2017

@thegreatgazoo i've never used raidz-n before (all my pools are mirrors) so i may be doing something wrong. That said, reproducer here: https://gist.github.com/loli10K/cc5b56612aa74871397066c2f6ac75d8.

@gamanakis
Copy link
Contributor

I can reproduce this (raidz2) on FreeBSD 11.0, too.

@ahrens
Copy link
Member

ahrens commented Apr 28, 2017

@loli10K Thanks for the great script. I was able to reproduce it and I understand what's causing the problem. @grwilson and I are discussing what the best fix will be. The basic problem is:

  • scrub while one leaf of RAIDZ vdev is offline
  • vdev_dtl_reassess called, scrub_txg nonzero, scrub_done=1
  • vdev_dtl_should_excise returns TRUE because vdev_resilver_txg=0 (this was not a resilver)
  • vdev_dtl[DTL_SCRUB] is empty

ahrens added a commit to ahrens/illumos that referenced this issue May 5, 2017
Reviewed by: George Wilson george.wilson@delphix.com

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

See also openzfs/zfs#5806
ahrens added a commit to ahrens/zfs that referenced this issue May 5, 2017
Reviewed by: George Wilson george.wilson@delphix.com

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

Closes openzfs#5806
behlendorf pushed a commit that referenced this issue May 10, 2017
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

OpenZFS-issue: https://www.illumos.org/issues/8166
OpenZFS-commit: openzfs/openzfs#372
Closes #5806 
Closes #6103
ahrens added a commit to ahrens/illumos that referenced this issue May 10, 2017
Reviewed by: George Wilson george.wilson@delphix.com

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

See also openzfs/zfs#5806
prakashsurya pushed a commit to openzfs/openzfs that referenced this issue May 22, 2017
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

See also openzfs/zfs#5806

Closes #372
uqs pushed a commit to freebsd/freebsd-src that referenced this issue May 26, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


git-svn-id: svn+ssh://svn.freebsd.org/base/head@318943 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
uqs pushed a commit to freebsd/freebsd-src that referenced this issue May 26, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue May 26, 2017
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

OpenZFS-issue: https://www.illumos.org/issues/8166
OpenZFS-commit: openzfs/openzfs#372
Closes openzfs#5806
Closes openzfs#6103
mat813 pushed a commit to mat813/freebsd that referenced this issue May 29, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


git-svn-id: https://svn.freebsd.org/base/vendor-sys/illumos/dist@318942 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
mat813 pushed a commit to mat813/freebsd that referenced this issue May 29, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


git-svn-id: https://svn.freebsd.org/base/head@318943 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
bdrewery pushed a commit to bdrewery/freebsd that referenced this issue Jun 2, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


git-svn-id: svn+ssh://svn.freebsd.org/base/head@318943 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
lundman pushed a commit to openzfsonosx/zfs that referenced this issue Jun 5, 2017
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

See also openzfs/zfs#5806
uqs pushed a commit to freebsd/freebsd-src that referenced this issue Jun 6, 2017
 MFV r318942: 8166 zpool scrub thinks it repaired offline device

 https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

PR:		219537
Sponsored by:	The FreeBSD Foundation
uqs pushed a commit to freebsd/freebsd-src that referenced this issue Jun 6, 2017
 MFV r318942: 8166 zpool scrub thinks it repaired offline device

 https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

PR:		219537
Approved by:	re (kib)
Sponsored by:	The FreeBSD Foundation
mat813 pushed a commit to mat813/freebsd that referenced this issue Jun 7, 2017
 MFV r318942: 8166 zpool scrub thinks it repaired offline device

 https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

PR:		219537
Sponsored by:	The FreeBSD Foundation


git-svn-id: https://svn.freebsd.org/base/stable/10@319625 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
mat813 pushed a commit to mat813/freebsd that referenced this issue Jun 7, 2017
 MFV r318942: 8166 zpool scrub thinks it repaired offline device

 https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

PR:		219537
Approved by:	re (kib)
Sponsored by:	The FreeBSD Foundation


git-svn-id: https://svn.freebsd.org/base/stable/11@319624 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
tonyhutter pushed a commit that referenced this issue Jun 9, 2017
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

OpenZFS-issue: https://www.illumos.org/issues/8166
OpenZFS-commit: openzfs/openzfs#372
Closes #5806
Closes #6103
uqs pushed a commit to freebsd/freebsd-src that referenced this issue Jun 20, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
brooksdavis pushed a commit to CTSRD-CHERI/cheribsd that referenced this issue Jun 24, 2017
illumos/illumos-gate@2d2f193
illumos/illumos-gate@2d2f193

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also openzfs/zfs#5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
lundman pushed a commit to openzfsonwindows/ZFSin that referenced this issue Nov 1, 2017
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

See also openzfs/zfs#5806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants