New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip spurious resilver IO on raidz vdev #5316
Skip spurious resilver IO on raidz vdev #5316
Conversation
Very clever, and a much smaller change than I would have guessed. This is really nice optimization. @ironMann what are your thoughts? |
Very nice find. Anything that will decrease replace time is always welcome. We'll have to prepare for questions like: why running scrub immediately after resilver finds cksum errors? :)
@thegreatgazoo What do you mean by this? I'll add some questions inline |
module/zfs/dsl_scan.c
Outdated
if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)) | ||
return (B_FALSE); | ||
|
||
if (vd->vdev_ops == &vdev_raidz_ops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can be checked before DTL on the whole vdev? It should me more precise for raidz (more possibility to skip resilver zio), but then probably vdev_dtl_reassess()
would have to be able to know to delete that DTL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)
first, I was able to implement several optimizations in vdev_raidz_need_resilver()
, e.g. at the beginning:
if (s + nparity >= dcols) /* spans all child vdevs */
return (B_TRUE);
The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)
is true.
module/zfs/vdev_raidz.c
Outdated
uint64_t s = ((psize - 1) >> unit_shift) + 1; | ||
/* The first column for this stripe. */ | ||
uint64_t f = b % dcols; | ||
uint64_t c, devidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad the vdev_raidz_map_alloc()
is allocation heavy, so this raidz geometry have to be calculated separately...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc()
without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ironMann Thanks for the comments.
If scrub finds csum errors after a patched resilver, then it's a bug in this patch. My own tests haven't been able to hit such a case, but it's far from extensive. If you find such a test case, please let me know.
As to overwriting good data, it may be possible, e.g.:
- A BP has 2 DVAs, one on a degraded raidz, the other on another healthy vdev.
- Resilver algorithm reads the 2nd DVA from the healthy vdev, and use that to overwrite the 1st DVA on the degraded raidz
- But the DVA on the degraded raidz is actually healthy, because it doesn't use the child vdev that is bad or offline.
With this patch, both the 1st read and then 2nd write will be eliminated.
module/zfs/dsl_scan.c
Outdated
if (!vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)) | ||
return (B_FALSE); | ||
|
||
if (vd->vdev_ops == &vdev_raidz_ops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)
first, I was able to implement several optimizations in vdev_raidz_need_resilver()
, e.g. at the beginning:
if (s + nparity >= dcols) /* spans all child vdevs */
return (B_TRUE);
The whole function was essential skipped for the big blocks. This is correct only under the assumption that vdev_dtl_contains(vd, DTL_PARTIAL, phys_birth, 1)
is true.
module/zfs/vdev_raidz.c
Outdated
uint64_t s = ((psize - 1) >> unit_shift) + 1; | ||
/* The first column for this stripe. */ | ||
uint64_t f = b % dcols; | ||
uint64_t c, devidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but there seemed to be no clean way to use vdev_raidz_map_alloc()
without adding complexity to the already quite complicated function. By duplicating some of the calculation here, I was able to simplify it a lot because in here: a) we don't care about skip sectors, b) we only care about blocks that span less than a complete row.
That's wonderful and make the raidz algorithm better 👍 BTW, shall we need go a step further in
Shall we make decisions instead of chaos? |
@GeLiXin There could be some possible optimization in |
9f438bf
to
0029cdd
Compare
@thegreatgazoo Yes, It's should be done in another patches, and it may take more times in performance test rather than coding. I will put it into my schedule, and I will be really appreciated if you can review it then. |
0029cdd
to
6ab7617
Compare
@thegreatgazoo you can resolve the lint warnings by rebasing on master. |
6ab7617
to
f8ebfcf
Compare
Is there any reason not to test / use this, it seems fairly straight forward. |
@prometheanfire this is ready for additional testing. |
ya, needs a rebase before I can though... |
module/zfs/dsl_scan.c
Outdated
@@ -1833,6 +1833,36 @@ dsl_scan_scrub_done(zio_t *zio) | |||
mutex_exit(&spa->spa_scrub_lock); | |||
} | |||
|
|||
static boolean_t | |||
dsl_scan_need_resilver(spa_t *spa, const dva_t *dva, | |||
size_t size, uint64_t phys_birth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation here doesn't match style (should be 4 spaces)
module/zfs/dsl_scan.c
Outdated
} | ||
if (!needs_io) | ||
needs_io = dsl_scan_need_resilver(spa, dva, | ||
size, phys_birth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation should be 4 spaces more than previous line.
module/zfs/dsl_scan.c
Outdated
|
||
offset = DVA_GET_OFFSET(dva); | ||
if (vd->vdev_ops == &vdev_raidz_ops) | ||
return (vdev_raidz_need_resilver(vd, offset, size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more object-oriented approach (which I think would be preferable) would be to add a function to the vdev ops which tells us if resilver is needed. This would be trivial for all but raidz vdev types.
module/zfs/vdev_raidz.c
Outdated
@@ -1440,6 +1440,38 @@ vdev_raidz_child_done(zio_t *zio) | |||
rc->rc_skipped = 0; | |||
} | |||
|
|||
boolean_t | |||
vdev_raidz_need_resilver(vdev_t *vd, uint64_t offset, size_t psize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we did something similar for zpool initialize
. It's doing a similar calculation. Once that's upstreamed, this could be implemented by calling vdev_op_xlate
, and checking if the result's rs_start==rs_end
. The caller would have to loop over the child vdev's and call vdev_op_xlate
on each of them. Not sure if that would actually be better, but thought you'd find it interesting. It's also an example of adding a new vdev op, which I think should be done here.
delphix/delphix-os@f08e5bc#diff-826708a55165ad963711ccae15edaad6R2407
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a comment explaining what this function is supposed to do, and its assumptions. For example, it assumes that the at least one of the children has a dirty dtl.
module/zfs/vdev_raidz.c
Outdated
uint64_t s = ((psize - 1) >> unit_shift) + 1; | ||
/* The first column for this stripe. */ | ||
uint64_t f = b % dcols; | ||
uint64_t c, devidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using descriptive variable names here would aid in comprehension. (rather than b, s, f, c). From what I can tell:
b
is the offset in sectors
s
is the number of sectors
f
is the child that this block starts on
c
is the sector count (maybe i
would be better here since it's a generic iterator)
module/zfs/vdev_raidz.c
Outdated
* vdev_dtl_contains(). So here just check cvd with | ||
* vdev_dtl_empty(), cheaper and a good approximation. | ||
*/ | ||
devidx = (f + c) % dcols; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could declare devidx inside the for loop
f8ebfcf
to
5f28635
Compare
@thegreatgazoo I hope you don't mind, but I've rebased and updated this PR to address @ahrens review comments.
|
@grwilson may also be interested in looking at this. |
module/zfs/dsl_scan.c
Outdated
* When the offset does not intersect with a dirty leaf DTL | ||
* then it may be possible to skip the resilver IO. | ||
*/ | ||
if (!vdev_dtl_need_resilver(vd, DVA_GET_OFFSET(dva), size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename size
to psize
, and add a comment explaining why we pass around psize
rather than asize
. It would be a more obvious interface to use asize
, but I guess that psize
lets the RAIDZ code more easily ignore the pad sectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I used psize exactly because I wanted to ignore skip sectors.
module/zfs/vdev_mirror.c
Outdated
@@ -606,13 +606,20 @@ vdev_mirror_state_change(vdev_t *vd, int faulted, int degraded) | |||
vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE); | |||
} | |||
|
|||
static boolean_t | |||
vdev_mirror_need_resilver(vdev_t *vd, uint64_t offset, size_t psize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this use a NULL func like vdev_disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can set vd->vdev_op_need_resilver = NULL
for mirrors if you prefer. At the time it felt clearer to register something for both mirrors and raidz. I'll remove it.
module/zfs/vdev_raidz.c
Outdated
{ | ||
uint64_t dcols = vd->vdev_children; | ||
uint64_t nparity = vd->vdev_nparity; | ||
uint64_t unit_shift = vd->vdev_top->vdev_ashift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to ashift
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adopted this name from vdev_raidz_map_alloc(unit_shift), wanted to be consistent with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to change this let's do it in both places.
module/zfs/dsl_scan.c
Outdated
} | ||
} | ||
if (!needs_io) | ||
needs_io = dsl_scan_need_resilver(spa, dva, size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider renaming size -> psize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
5f28635
to
dd6b9e4
Compare
Refreshed with the suggested changes. |
dd6b9e4
to
688174d
Compare
On a raidz vdev, a block that does not span all child vdev's, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Signed-off-by: Isaac Huang <he.huang@intel.com>
Refreshed. Ztest does occasionally call vdev_dtl_need_resilver() with a leaf vdev. The ASSERT was changed to a normal conditional which returns B_TRUE in this case. |
@behlendorf Thanks for updating this. Sorry I was overwhelmed by my other projects. |
boolean_t | ||
vdev_dtl_need_resilver(vdev_t *vd, uint64_t offset, size_t psize) | ||
{ | ||
ASSERT(vd != vd->vdev_spa->spa_root_vdev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this assertion. Shouldn't it be ASSERT3P(vd, ==, vd->vdev_top) instead? It's checking a DVA, so the function only makes sense for a top-level vdev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ztest constructs pools with leafs directly attached to the root, so the patch only explicitly excludes the root vdev.
On a raidz vdev, a block that does not span all child vdev's, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO not only is wasteful, but also adds the risk of overwriting good data.
For example, in the graph below, if disk C fails, the two blocks that start at 3D and 8E will not need to be resilvered:
This patch eliminates such spurious IOs.
I've tested the patch on a 8+1 raidz1 vdev, by offline/online/replace a drive to trigger resilver, then scrub to make sure every block is correctly resilvered. All worked fine. The pool was populated with about 3G data, mostly small files (e.g. zfs/spl source tree), which is ideal to test this patch. The following histogram shows ZIOs skipped with this patch:
The ashift was 12 on the 9-drive raidz1, so the largest possible block to skip is 28KB (psize excluding parity). Not surprisingly, most skipped blocks were no more than 4KB.
Actually most of the blocks were skipped, the resilver was about 30% faster than a same resilver without skipping spurious ZIOs.
Signed-off-by: Isaac Huang he.huang@intel.com