Skip spurious resilver IO on raidz vdev #5316

thegreatgazoo · 2016-10-21T05:08:33Z

On a raidz vdev, a block that does not span all child vdev's, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO not only is wasteful, but also adds the risk of overwriting good data.

For example, in the graph below, if disk C fails, the two blocks that start at 3D and 8E will not need to be resilvered:

This patch eliminates such spurious IOs.

I've tested the patch on a 8+1 raidz1 vdev, by offline/online/replace a drive to trigger resilver, then scrub to make sure every block is correctly resilvered. All worked fine. The pool was populated with about 3G data, mostly small files (e.g. zfs/spl source tree), which is ideal to test this patch. The following histogram shows ZIOs skipped with this patch:

The ashift was 12 on the 9-drive raidz1, so the largest possible block to skip is 28KB (psize excluding parity). Not surprisingly, most skipped blocks were no more than 4KB.

Actually most of the blocks were skipped, the resilver was about 30% faster than a same resilver without skipping spurious ZIOs.

Signed-off-by: Isaac Huang he.huang@intel.com

thegreatgazoo · 2016-10-21T20:06:40Z

Here's a better graph showing 2 histograms of actual resilver ZIO sizes - one stock resilver, the other patched with this PR, everything else being identical:

The vertical dashed line is at 28K:

On its right, the two histograms are identical and completely overlap, because no spurious IO is possible at over 28K block size in a 9-drive raidz1.
On its left, clearly the patched resilver algorithm did much less ZIOs everywhere, because many ZIOs are spurious and thus skipped.

behlendorf · 2016-10-21T22:58:32Z

Very clever, and a much smaller change than I would have guessed. This is really nice optimization. @ironMann what are your thoughts?

ironMann · 2016-10-24T12:52:57Z

Very nice find. Anything that will decrease replace time is always welcome. We'll have to prepare for questions like: why running scrub immediately after resilver finds cksum errors? :)

...but also adds the risk of overwriting good data.

@thegreatgazoo What do you mean by this? I'll add some questions inline

ironMann · 2016-10-24T12:58:02Z

module/zfs/dsl_scan.c

+	if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1))
+		return (B_FALSE);
+
+	if (vd->vdev_ops == &vdev_raidz_ops)


I wonder if this can be checked before DTL on the whole vdev? It should me more precise for raidz (more possibility to skip resilver zio), but then probably vdev_dtl_reassess() would have to be able to know to delete that DTL.

By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) first, I was able to implement several optimizations in vdev_raidz_need_resilver(), e.g. at the beginning:

if (s + nparity >= dcols) /* spans all child vdevs */ return (B_TRUE);

The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) is true.

ironMann · 2016-10-24T13:01:00Z

module/zfs/vdev_raidz.c

+	uint64_t s = ((psize - 1) >> unit_shift) + 1;
+	/* The first column for this stripe. */
+	uint64_t f = b % dcols;
+	uint64_t c, devidx;


Too bad the vdev_raidz_map_alloc() is allocation heavy, so this raidz geometry have to be calculated separately...

I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc() without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.

thegreatgazoo

@ironMann Thanks for the comments.

If scrub finds csum errors after a patched resilver, then it's a bug in this patch. My own tests haven't been able to hit such a case, but it's far from extensive. If you find such a test case, please let me know.

As to overwriting good data, it may be possible, e.g.:

A BP has 2 DVAs, one on a degraded raidz, the other on another healthy vdev.
Resilver algorithm reads the 2nd DVA from the healthy vdev, and use that to overwrite the 1st DVA on the degraded raidz
But the DVA on the degraded raidz is actually healthy, because it doesn't use the child vdev that is bad or offline.

With this patch, both the 1st read and then 2nd write will be eliminated.

thegreatgazoo · 2016-10-24T21:00:48Z

module/zfs/dsl_scan.c

+	if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1))
+		return (B_FALSE);
+
+	if (vd->vdev_ops == &vdev_raidz_ops)


By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) first, I was able to implement several optimizations in vdev_raidz_need_resilver(), e.g. at the beginning:

if (s + nparity >= dcols) /* spans all child vdevs */ return (B_TRUE);

The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) is true.

thegreatgazoo · 2016-10-24T21:05:11Z

module/zfs/vdev_raidz.c

+	uint64_t s = ((psize - 1) >> unit_shift) + 1;
+	/* The first column for this stripe. */
+	uint64_t f = b % dcols;
+	uint64_t c, devidx;


I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc() without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.

GeLiXin · 2016-10-30T15:14:03Z

That's wonderful and make the raidz algorithm better 👍

BTW, shall we need go a step further in vdev_mirror_io_start(), which may lead to a different behaviour of resilver? Take the same example as you mentioned:

A BP has 2 DVAs, one on a degraded raidz, the other on another healthy vdev.
We can make precise decisions werther IO is needed by your dsl_scan_need_resilver() now, suppose the result is B_TRUE.
vdev_mirror_io_start() make a random(more or less) decision which vdev to read first, but there is different impact to performance.
If we chose the healthy vdev to read first, we can use the good data which we get write directly to the degraded one without raidz construct , but then the resilver affects all the top vdevs in the pool.
if we chose the degraded vdev to read first, it can repaire itself without impact the healthy vdev, but the resilver may spend more time.

Shall we make decisions instead of chaos?

thegreatgazoo · 2016-10-31T21:09:08Z

@GeLiXin There could be some possible optimization in vdev_mirror_io_start(). The example you outlined was not really spurious, but rather possible optimizations for a legitimate resilver IO. I don't have the time to look into that but I'd be happy to review patches if you'd want to work on it.

GeLiXin · 2016-11-01T11:54:45Z

@thegreatgazoo Yes, It's should be done in another patches, and it may take more times in performance test rather than coding. I will put it into my schedule, and I will be really appreciated if you can review it then.

behlendorf · 2016-11-02T22:59:33Z

@thegreatgazoo you can resolve the lint warnings by rebasing on master.

prometheanfire · 2016-12-29T23:36:16Z

Is there any reason not to test / use this, it seems fairly straight forward.

behlendorf · 2016-12-31T23:40:11Z

@prometheanfire this is ready for additional testing.

prometheanfire · 2017-01-05T19:41:46Z

ya, needs a rebase before I can though...

ahrens · 2017-01-12T00:44:53Z

module/zfs/dsl_scan.c

@@ -1833,6 +1833,36 @@ dsl_scan_scrub_done(zio_t *zio)
 	mutex_exit(&spa->spa_scrub_lock);
 }

+static boolean_t
+dsl_scan_need_resilver(spa_t *spa, const dva_t *dva,
+		size_t size, uint64_t phys_birth)


indentation here doesn't match style (should be 4 spaces)

ahrens · 2017-01-12T00:46:22Z

module/zfs/dsl_scan.c

-		}
+		if (!needs_io)
+			needs_io = dsl_scan_need_resilver(spa, dva,
+					size, phys_birth);


indentation should be 4 spaces more than previous line.

ahrens · 2017-01-12T00:48:02Z

module/zfs/dsl_scan.c

+
+	offset = DVA_GET_OFFSET(dva);
+	if (vd->vdev_ops == &vdev_raidz_ops)
+		return (vdev_raidz_need_resilver(vd, offset, size));


The more object-oriented approach (which I think would be preferable) would be to add a function to the vdev ops which tells us if resilver is needed. This would be trivial for all but raidz vdev types.

ahrens · 2017-01-12T01:02:59Z

module/zfs/vdev_raidz.c

@@ -1440,6 +1440,38 @@ vdev_raidz_child_done(zio_t *zio)
 	rc->rc_skipped = 0;
 }

+boolean_t
+vdev_raidz_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)


FYI, we did something similar for zpool initialize. It's doing a similar calculation. Once that's upstreamed, this could be implemented by calling vdev_op_xlate, and checking if the result's rs_start==rs_end. The caller would have to loop over the child vdev's and call vdev_op_xlate on each of them. Not sure if that would actually be better, but thought you'd find it interesting. It's also an example of adding a new vdev op, which I think should be done here.
delphix/delphix-os@f08e5bc#diff-826708a55165ad963711ccae15edaad6R2407

I would suggest adding a comment explaining what this function is supposed to do, and its assumptions. For example, it assumes that the at least one of the children has a dirty dtl.

ahrens · 2017-01-12T01:03:31Z

module/zfs/vdev_raidz.c

+	uint64_t s = ((psize - 1) >> unit_shift) + 1;
+	/* The first column for this stripe. */
+	uint64_t f = b % dcols;
+	uint64_t c, devidx;


Using descriptive variable names here would aid in comprehension. (rather than b, s, f, c). From what I can tell:
b is the offset in sectors
s is the number of sectors
f is the child that this block starts on
c is the sector count (maybe i would be better here since it's a generic iterator)

ahrens · 2017-01-12T01:08:25Z

module/zfs/vdev_raidz.c

+		 * vdev_dtl_contains(). So here just check cvd with
+		 * vdev_dtl_empty(), cheaper and a good approximation.
+		 */
+		devidx = (f + c) % dcols;


you could declare devidx inside the for loop

behlendorf · 2017-05-10T21:58:34Z

@thegreatgazoo I hope you don't mind, but I've rebased and updated this PR to address @ahrens review comments.

A vdev_op_need_resilver op was added to the vdev_t. The vdev_dtl_need_resilver() function was added to call the new op which only makes sense for top-level vdevs.
Additional comments added.
I decided against changing the variable names in vdev_raidz_need_resilver() since they mirror the names used in vdev_raidz_map_alloc() This way anyone already familiar with the RAIDZ implementation should be familiar with the names. However, I did shamelessly steal the comments describing these variables from vdev_raidz_map_alloc().

ahrens · 2017-05-10T22:45:48Z

@grwilson may also be interested in looking at this.

ahrens · 2017-05-10T23:31:51Z

module/zfs/dsl_scan.c

+	 * When the offset does not intersect with a dirty leaf DTL
+	 * then it may be possible to skip the resilver IO.
+	 */
+	if (!vdev_dtl_need_resilver(vd, DVA_GET_OFFSET(dva), size))


Let's rename size to psize, and add a comment explaining why we pass around psize rather than asize. It would be a more obvious interface to use asize, but I guess that psize lets the RAIDZ code more easily ignore the pad sectors.

Yes I used psize exactly because I wanted to ignore skip sectors.

ahrens · 2017-05-10T23:33:12Z

module/zfs/vdev_mirror.c

@@ -606,13 +606,20 @@ vdev_mirror_state_change(vdev_t *vd, int faulted, int degraded)
 		vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE);
 }

+static boolean_t
+vdev_mirror_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)


can't this use a NULL func like vdev_disk?

Sure, we can set vd->vdev_op_need_resilver = NULL for mirrors if you prefer. At the time it felt clearer to register something for both mirrors and raidz. I'll remove it.

ahrens · 2017-05-10T23:36:00Z

module/zfs/vdev_raidz.c

+{
+	uint64_t dcols = vd->vdev_children;
+	uint64_t nparity = vd->vdev_nparity;
+	uint64_t unit_shift = vd->vdev_top->vdev_ashift;


rename to ashift?

I adopted this name from vdev_raidz_map_alloc(unit_shift), wanted to be consistent with that.

If we're going to change this let's do it in both places.

ahrens · 2017-05-10T23:40:58Z

module/zfs/dsl_scan.c

-			}
-		}
+		if (!needs_io)
+			needs_io = dsl_scan_need_resilver(spa, dva, size,


consider renaming size -> psize.

behlendorf · 2017-05-12T00:45:37Z

Refreshed with the suggested changes.

On a raidz vdev, a block that does not span all child vdev's, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Signed-off-by: Isaac Huang <he.huang@intel.com>

behlendorf · 2017-05-12T16:09:22Z

Refreshed. Ztest does occasionally call vdev_dtl_need_resilver() with a leaf vdev. The ASSERT was changed to a normal conditional which returns B_TRUE in this case.

thegreatgazoo · 2017-05-13T21:16:51Z

@behlendorf Thanks for updating this. Sorry I was overwhelmed by my other projects.

thegreatgazoo · 2017-05-13T21:20:47Z

module/zfs/vdev.c

+boolean_t
+vdev_dtl_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)
+{
+	ASSERT(vd != vd->vdev_spa->spa_root_vdev);


I don't understand this assertion. Shouldn't it be ASSERT3P(vd, ==, vd->vdev_top) instead? It's checking a DVA, so the function only makes sense for a top-level vdev.

ztest constructs pools with leafs directly attached to the root, so the patch only explicitly excludes the root vdev.

ironMann reviewed Oct 24, 2016

View reviewed changes

thegreatgazoo commented Oct 24, 2016

View reviewed changes

behlendorf added the Type: Performance Performance improvement or performance problem label Oct 25, 2016

thegreatgazoo force-pushed the new_spurious_resilver branch from 9f438bf to 0029cdd Compare October 31, 2016 22:08

thegreatgazoo force-pushed the new_spurious_resilver branch from 0029cdd to 6ab7617 Compare November 2, 2016 03:46

thegreatgazoo force-pushed the new_spurious_resilver branch from 6ab7617 to f8ebfcf Compare November 3, 2016 22:39

behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 4, 2016

kernelOfTruth mentioned this pull request Dec 30, 2016

Provide a way to recover from "permanent errors" #5528

Closed

behlendorf added the OpenZFS Review label Jan 11, 2017

ahrens reviewed Jan 12, 2017

View reviewed changes

behlendorf force-pushed the new_spurious_resilver branch from f8ebfcf to 5f28635 Compare May 10, 2017 21:50

ahrens reviewed May 10, 2017

View reviewed changes

behlendorf force-pushed the new_spurious_resilver branch from 5f28635 to dd6b9e4 Compare May 12, 2017 00:45

ahrens approved these changes May 12, 2017

View reviewed changes

behlendorf force-pushed the new_spurious_resilver branch from dd6b9e4 to 688174d Compare May 12, 2017 16:06

behlendorf approved these changes May 12, 2017

View reviewed changes

behlendorf merged commit 3d6da72 into openzfs:master May 13, 2017

thegreatgazoo deleted the new_spurious_resilver branch May 13, 2017 21:17

thegreatgazoo commented May 13, 2017

View reviewed changes

Skip spurious resilver IO on raidz vdev #5316

Skip spurious resilver IO on raidz vdev #5316

Conversation

thegreatgazoo commented Oct 21, 2016

thegreatgazoo commented Oct 21, 2016 • edited

behlendorf commented Oct 21, 2016

ironMann commented Oct 24, 2016

ironMann Oct 24, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thegreatgazoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeLiXin commented Oct 30, 2016

thegreatgazoo commented Oct 31, 2016

GeLiXin commented Nov 1, 2016

behlendorf commented Nov 2, 2016

prometheanfire commented Dec 29, 2016

behlendorf commented Dec 31, 2016

prometheanfire commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented May 10, 2017

ahrens commented May 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf May 12, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented May 12, 2017

behlendorf commented May 12, 2017

thegreatgazoo commented May 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thegreatgazoo commented Oct 21, 2016 •

edited

ironMann Oct 24, 2016 •

edited

behlendorf May 12, 2017 •

edited