Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip spurious resilver IO on raidz vdev #5316

Merged
merged 1 commit into from May 13, 2017

Conversation

thegreatgazoo
Copy link

On a raidz vdev, a block that does not span all child vdev's, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO not only is wasteful, but also adds the risk of overwriting good data.

For example, in the graph below, if disk C fails, the two blocks that start at 3D and 8E will not need to be resilvered:
e28fc7ba04b0ce92b1e99d34f82531fb
This patch eliminates such spurious IOs.

I've tested the patch on a 8+1 raidz1 vdev, by offline/online/replace a drive to trigger resilver, then scrub to make sure every block is correctly resilvered. All worked fine. The pool was populated with about 3G data, mostly small files (e.g. zfs/spl source tree), which is ideal to test this patch. The following histogram shows ZIOs skipped with this patch:
io sizes

The ashift was 12 on the 9-drive raidz1, so the largest possible block to skip is 28KB (psize excluding parity). Not surprisingly, most skipped blocks were no more than 4KB.

Actually most of the blocks were skipped, the resilver was about 30% faster than a same resilver without skipping spurious ZIOs.

Signed-off-by: Isaac Huang he.huang@intel.com

@thegreatgazoo
Copy link
Author

thegreatgazoo commented Oct 21, 2016

Here's a better graph showing 2 histograms of actual resilver ZIO sizes - one stock resilver, the other patched with this PR, everything else being identical:
plot
The vertical dashed line is at 28K:

  • On its right, the two histograms are identical and completely overlap, because no spurious IO is possible at over 28K block size in a 9-drive raidz1.
  • On its left, clearly the patched resilver algorithm did much less ZIOs everywhere, because many ZIOs are spurious and thus skipped.

@behlendorf
Copy link
Contributor

Very clever, and a much smaller change than I would have guessed. This is really nice optimization. @ironMann what are your thoughts?

@ironMann
Copy link
Contributor

Very nice find. Anything that will decrease replace time is always welcome. We'll have to prepare for questions like: why running scrub immediately after resilver finds cksum errors? :)

...but also adds the risk of overwriting good data.

@thegreatgazoo What do you mean by this? I'll add some questions inline

if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1))
return (B_FALSE);

if (vd->vdev_ops == &vdev_raidz_ops)
Copy link
Contributor

@ironMann ironMann Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be checked before DTL on the whole vdev? It should me more precise for raidz (more possibility to skip resilver zio), but then probably vdev_dtl_reassess() would have to be able to know to delete that DTL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) first, I was able to implement several optimizations in vdev_raidz_need_resilver(), e.g. at the beginning:

if (s + nparity >= dcols) /* spans all child vdevs */
    return (B_TRUE);

The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) is true.

uint64_t s = ((psize - 1) >> unit_shift) + 1;
/* The first column for this stripe. */
uint64_t f = b % dcols;
uint64_t c, devidx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad the vdev_raidz_map_alloc() is allocation heavy, so this raidz geometry have to be calculated separately...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc() without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.

Copy link
Author

@thegreatgazoo thegreatgazoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ironMann Thanks for the comments.

If scrub finds csum errors after a patched resilver, then it's a bug in this patch. My own tests haven't been able to hit such a case, but it's far from extensive. If you find such a test case, please let me know.

As to overwriting good data, it may be possible, e.g.:

  1. A BP has 2 DVAs, one on a degraded raidz, the other on another healthy vdev.
  2. Resilver algorithm reads the 2nd DVA from the healthy vdev, and use that to overwrite the 1st DVA on the degraded raidz
  3. But the DVA on the degraded raidz is actually healthy, because it doesn't use the child vdev that is bad or offline.

With this patch, both the 1st read and then 2nd write will be eliminated.

if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1))
return (B_FALSE);

if (vd->vdev_ops == &vdev_raidz_ops)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) first, I was able to implement several optimizations in vdev_raidz_need_resilver(), e.g. at the beginning:

if (s + nparity >= dcols) /* spans all child vdevs */
    return (B_TRUE);

The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1) is true.

uint64_t s = ((psize - 1) >> unit_shift) + 1;
/* The first column for this stripe. */
uint64_t f = b % dcols;
uint64_t c, devidx;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc() without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Oct 25, 2016
@GeLiXin
Copy link
Contributor

GeLiXin commented Oct 30, 2016

That's wonderful and make the raidz algorithm better 👍

BTW, shall we need go a step further in vdev_mirror_io_start(), which may lead to a different behaviour of resilver? Take the same example as you mentioned:

  1. A BP has 2 DVAs, one on a degraded raidz, the other on another healthy vdev.
  2. We can make precise decisions werther IO is needed by your dsl_scan_need_resilver() now, suppose the result is B_TRUE.
  3. vdev_mirror_io_start() make a random(more or less) decision which vdev to read first, but there is different impact to performance.
  4. If we chose the healthy vdev to read first, we can use the good data which we get write directly to the degraded one without raidz construct , but then the resilver affects all the top vdevs in the pool.
  5. if we chose the degraded vdev to read first, it can repaire itself without impact the healthy vdev, but the resilver may spend more time.

Shall we make decisions instead of chaos?

@thegreatgazoo
Copy link
Author

@GeLiXin There could be some possible optimization in vdev_mirror_io_start(). The example you outlined was not really spurious, but rather possible optimizations for a legitimate resilver IO. I don't have the time to look into that but I'd be happy to review patches if you'd want to work on it.

@GeLiXin
Copy link
Contributor

GeLiXin commented Nov 1, 2016

@thegreatgazoo Yes, It's should be done in another patches, and it may take more times in performance test rather than coding. I will put it into my schedule, and I will be really appreciated if you can review it then.

@behlendorf
Copy link
Contributor

@thegreatgazoo you can resolve the lint warnings by rebasing on master.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 4, 2016
@prometheanfire
Copy link
Contributor

Is there any reason not to test / use this, it seems fairly straight forward.

@behlendorf
Copy link
Contributor

@prometheanfire this is ready for additional testing.

@prometheanfire
Copy link
Contributor

ya, needs a rebase before I can though...

@@ -1833,6 +1833,36 @@ dsl_scan_scrub_done(zio_t *zio)
mutex_exit(&spa->spa_scrub_lock);
}

static boolean_t
dsl_scan_need_resilver(spa_t *spa, const dva_t *dva,
size_t size, uint64_t phys_birth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation here doesn't match style (should be 4 spaces)

}
if (!needs_io)
needs_io = dsl_scan_need_resilver(spa, dva,
size, phys_birth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation should be 4 spaces more than previous line.


offset = DVA_GET_OFFSET(dva);
if (vd->vdev_ops == &vdev_raidz_ops)
return (vdev_raidz_need_resilver(vd, offset, size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more object-oriented approach (which I think would be preferable) would be to add a function to the vdev ops which tells us if resilver is needed. This would be trivial for all but raidz vdev types.

@@ -1440,6 +1440,38 @@ vdev_raidz_child_done(zio_t *zio)
rc->rc_skipped = 0;
}

boolean_t
vdev_raidz_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we did something similar for zpool initialize. It's doing a similar calculation. Once that's upstreamed, this could be implemented by calling vdev_op_xlate, and checking if the result's rs_start==rs_end. The caller would have to loop over the child vdev's and call vdev_op_xlate on each of them. Not sure if that would actually be better, but thought you'd find it interesting. It's also an example of adding a new vdev op, which I think should be done here.
delphix/delphix-os@f08e5bc#diff-826708a55165ad963711ccae15edaad6R2407

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a comment explaining what this function is supposed to do, and its assumptions. For example, it assumes that the at least one of the children has a dirty dtl.

uint64_t s = ((psize - 1) >> unit_shift) + 1;
/* The first column for this stripe. */
uint64_t f = b % dcols;
uint64_t c, devidx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using descriptive variable names here would aid in comprehension. (rather than b, s, f, c). From what I can tell:
b is the offset in sectors
s is the number of sectors
f is the child that this block starts on
c is the sector count (maybe i would be better here since it's a generic iterator)

* vdev_dtl_contains(). So here just check cvd with
* vdev_dtl_empty(), cheaper and a good approximation.
*/
devidx = (f + c) % dcols;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could declare devidx inside the for loop

@behlendorf
Copy link
Contributor

@thegreatgazoo I hope you don't mind, but I've rebased and updated this PR to address @ahrens review comments.

  • A vdev_op_need_resilver op was added to the vdev_t. The vdev_dtl_need_resilver() function was added to call the new op which only makes sense for top-level vdevs.
  • Additional comments added.
  • I decided against changing the variable names in vdev_raidz_need_resilver() since they mirror the names used in vdev_raidz_map_alloc() This way anyone already familiar with the RAIDZ implementation should be familiar with the names. However, I did shamelessly steal the comments describing these variables from vdev_raidz_map_alloc().

@ahrens
Copy link
Member

ahrens commented May 10, 2017

@grwilson may also be interested in looking at this.

* When the offset does not intersect with a dirty leaf DTL
* then it may be possible to skip the resilver IO.
*/
if (!vdev_dtl_need_resilver(vd, DVA_GET_OFFSET(dva), size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename size to psize, and add a comment explaining why we pass around psize rather than asize. It would be a more obvious interface to use asize, but I guess that psize lets the RAIDZ code more easily ignore the pad sectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I used psize exactly because I wanted to ignore skip sectors.

@@ -606,13 +606,20 @@ vdev_mirror_state_change(vdev_t *vd, int faulted, int degraded)
vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE);
}

static boolean_t
vdev_mirror_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this use a NULL func like vdev_disk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can set vd->vdev_op_need_resilver = NULL for mirrors if you prefer. At the time it felt clearer to register something for both mirrors and raidz. I'll remove it.

{
uint64_t dcols = vd->vdev_children;
uint64_t nparity = vd->vdev_nparity;
uint64_t unit_shift = vd->vdev_top->vdev_ashift;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to ashift?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted this name from vdev_raidz_map_alloc(unit_shift), wanted to be consistent with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to change this let's do it in both places.

}
}
if (!needs_io)
needs_io = dsl_scan_need_resilver(spa, dva, size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider renaming size -> psize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@behlendorf
Copy link
Contributor

Refreshed with the suggested changes.

On a raidz vdev, a block that does not span all child vdev's, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.

This patch eliminates such spurious IOs.

Signed-off-by: Isaac Huang <he.huang@intel.com>
@behlendorf
Copy link
Contributor

Refreshed. Ztest does occasionally call vdev_dtl_need_resilver() with a leaf vdev. The ASSERT was changed to a normal conditional which returns B_TRUE in this case.

@behlendorf behlendorf merged commit 3d6da72 into openzfs:master May 13, 2017
@thegreatgazoo
Copy link
Author

@behlendorf Thanks for updating this. Sorry I was overwhelmed by my other projects.

@thegreatgazoo thegreatgazoo deleted the new_spurious_resilver branch May 13, 2017 21:17
boolean_t
vdev_dtl_need_resilver(vdev_t *vd, uint64_t offset, size_t psize)
{
ASSERT(vd != vd->vdev_spa->spa_root_vdev);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this assertion. Shouldn't it be ASSERT3P(vd, ==, vd->vdev_top) instead? It's checking a DVA, so the function only makes sense for a top-level vdev.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ztest constructs pools with leafs directly attached to the root, so the patch only explicitly excludes the root vdev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants