New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sequential (two-phase) resilvering #3625

Closed
Deewiant opened this Issue Jul 24, 2015 · 23 comments

Comments

Projects
None yet
@Deewiant

Deewiant commented Jul 24, 2015

https://blogs.oracle.com/roch/entry/sequential_resilvering describes a two-phase resilvering process which avoids random I/O, potentially dramatically speeding up resilvering especially on HDDs.

As far as I know ZoL doesn't do anything like this, so I created this issue to keep track of the situation.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jul 24, 2015

Member

@Deewiant thanks for filing this. Yes, this is something we've talked about implementing for some time. I think it would be great to implement when someone has the time.

Member

behlendorf commented Jul 24, 2015

@Deewiant thanks for filing this. Yes, this is something we've talked about implementing for some time. I think it would be great to implement when someone has the time.

@adilger

This comment has been minimized.

Show comment
Hide comment
@adilger

adilger Dec 3, 2015

Resilvering will also benefit greatly from the Metadata allocation class of issue #3779 to separate the metadata onto a separate SSD device.

Another option that was discussed in the past for mirror devices (not sure if this was ever implemented) is to do a full linear "dd" style copy of the working device to the replacement, and then fall back to a scrub to verify the data was written correctly. That gets the data redundancy back quickly, using nice large streaming IO requests to the disks, and then the scrub can be done at a lower priority. There still exists the possibility that the source device has latent sector errors, so the failing drive shouldn't be taken offline until after the scrub, so that it could be used to read any blocks with bad checksums, if possible.

adilger commented Dec 3, 2015

Resilvering will also benefit greatly from the Metadata allocation class of issue #3779 to separate the metadata onto a separate SSD device.

Another option that was discussed in the past for mirror devices (not sure if this was ever implemented) is to do a full linear "dd" style copy of the working device to the replacement, and then fall back to a scrub to verify the data was written correctly. That gets the data redundancy back quickly, using nice large streaming IO requests to the disks, and then the scrub can be done at a lower priority. There still exists the possibility that the source device has latent sector errors, so the failing drive shouldn't be taken offline until after the scrub, so that it could be used to read any blocks with bad checksums, if possible.

@jumbi77

This comment has been minimized.

Show comment
Hide comment
@jumbi77

jumbi77 Feb 26, 2016

I guess another approach to speed up resilvering is the parity declustered RAIDz/mirror #3497 ?!

Another idea which was not mentioned yet is the "RAID-Z/mirror hybrid allocator" from Oracle. Not sure if this accelerate also resilvering, but it is a nice performance boost in generell i guess. As far as i understand, the metadata are then mirrored within the raidz. Is it planned to implement this or is that obsolete because of #3779 ?

jumbi77 commented Feb 26, 2016

I guess another approach to speed up resilvering is the parity declustered RAIDz/mirror #3497 ?!

Another idea which was not mentioned yet is the "RAID-Z/mirror hybrid allocator" from Oracle. Not sure if this accelerate also resilvering, but it is a nice performance boost in generell i guess. As far as i understand, the metadata are then mirrored within the raidz. Is it planned to implement this or is that obsolete because of #3779 ?

@thegreatgazoo

This comment has been minimized.

Show comment
Hide comment
@thegreatgazoo

thegreatgazoo Mar 2, 2016

Member

@jumbi77 Parity declustered RAID is a new type of VDEV, called dRAID, which will offer scalable rebuild performance. It will not affect how RAIDz resilver works.

Member

thegreatgazoo commented Mar 2, 2016

@jumbi77 Parity declustered RAID is a new type of VDEV, called dRAID, which will offer scalable rebuild performance. It will not affect how RAIDz resilver works.

@jumbi77

This comment has been minimized.

Show comment
Hide comment
@jumbi77

jumbi77 May 15, 2016

Is there anybody working on that feature or have plans to implement it in the future? Just curious.

jumbi77 commented May 15, 2016

Is there anybody working on that feature or have plans to implement it in the future? Just curious.

@nwf

This comment has been minimized.

Show comment
Hide comment
@nwf

nwf Jul 6, 2016

Contributor

I'd like to suggest that there be a RAM-only queue mode for sequential resilvering, along the lines of rsync's asynchronous recursor. Rather than traverse all the metadata blocks at once and sort all the data blocks, which is likely to be an enormous collection which must itself be serialized to disk, it might be nice for the system to use a standard in-RAM producer/consumer queue with the producer (metadata recursor) stalling if the queue fills. The queue would of course be sorted by address on device (with multiple VDEVs intermixed) so that it acted as an enormous elevator queue. While no longer strictly sequential -- the recursor would find blocks out of order and have a limited ability to sort while blocks were in queue -- the collection of data block pointers no longer needs to be persisted to disk and there should be plenty of opportunities for streaming reads.

I suppose the other downside to such a thing is that it seems difficult to persist enough data to allow scrubs to pick up where they left off across exports or reboots, but I am not convinced that that is all that useful?

Contributor

nwf commented Jul 6, 2016

I'd like to suggest that there be a RAM-only queue mode for sequential resilvering, along the lines of rsync's asynchronous recursor. Rather than traverse all the metadata blocks at once and sort all the data blocks, which is likely to be an enormous collection which must itself be serialized to disk, it might be nice for the system to use a standard in-RAM producer/consumer queue with the producer (metadata recursor) stalling if the queue fills. The queue would of course be sorted by address on device (with multiple VDEVs intermixed) so that it acted as an enormous elevator queue. While no longer strictly sequential -- the recursor would find blocks out of order and have a limited ability to sort while blocks were in queue -- the collection of data block pointers no longer needs to be persisted to disk and there should be plenty of opportunities for streaming reads.

I suppose the other downside to such a thing is that it seems difficult to persist enough data to allow scrubs to pick up where they left off across exports or reboots, but I am not convinced that that is all that useful?

@ironMann

This comment has been minimized.

Show comment
Hide comment
@ironMann

ironMann Jul 6, 2016

Member

I've just started looking into this. Initially I had the same idea with RAM only solution (and it might be my first prototype), but I don't think it will be enough for larger pools. As design document in #1277 suggests there are few benefits with persisting the resilver log. I'm thinking about a solution in line with async_destroy, but I still have a lot to learn about zfs internals.

If somebody would like to provide mentorship for this project, feel free to contact me.

Member

ironMann commented Jul 6, 2016

I've just started looking into this. Initially I had the same idea with RAM only solution (and it might be my first prototype), but I don't think it will be enough for larger pools. As design document in #1277 suggests there are few benefits with persisting the resilver log. I'm thinking about a solution in line with async_destroy, but I still have a lot to learn about zfs internals.

If somebody would like to provide mentorship for this project, feel free to contact me.

@nwf

This comment has been minimized.

Show comment
Hide comment
@nwf

nwf Jul 6, 2016

Contributor

I think a strictly-read-only scrub might be worthwhile, too. Maybe make persistence optional (treat it as a queue without bound so that it never blocks)?

Alternatively, doing multiple metadata scans and selecting the next consecutive chunk of DVAs, again without persistence, might be simpler. In this design, one would walk the metadata in full and collect the lowest e.g. 16M data DVAs (in sorted order, so that it can be traversed with big streaming reads). By remembering what the 16Mth DVA was, the next walk of the metadata could collect the next 16M DVAs. This is an easily resumable (just remember which bin of DVAs was being scrubbed) and bounded-memory algorithm that should be easy to implement. (Credit, I think, is due to HAMMER2.)

Contributor

nwf commented Jul 6, 2016

I think a strictly-read-only scrub might be worthwhile, too. Maybe make persistence optional (treat it as a queue without bound so that it never blocks)?

Alternatively, doing multiple metadata scans and selecting the next consecutive chunk of DVAs, again without persistence, might be simpler. In this design, one would walk the metadata in full and collect the lowest e.g. 16M data DVAs (in sorted order, so that it can be traversed with big streaming reads). By remembering what the 16Mth DVA was, the next walk of the metadata could collect the next 16M DVAs. This is an easily resumable (just remember which bin of DVAs was being scrubbed) and bounded-memory algorithm that should be easy to implement. (Credit, I think, is due to HAMMER2.)

@thewacokid

This comment has been minimized.

Show comment
Hide comment
@thewacokid

thewacokid Jul 18, 2016

Perhaps this is not the correct way to push this, but SMR drives would absolutely love even a slightly sequential workload for resilvers. Current code degrades to 1-5 IOPs with SMR drives over time, which makes rebuilds take an eternity, especially with the queue depth stuck at 1 on the drive being replaced (is that a bug, or expected? I haven't had time to dig into the source).

Just to clarify, this is the SMR drive being resilvered to after a few hours (filtering out idle drives):
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.00 0.00 0.00 99.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdgt 0.00 0.00 0.00 6.00 0.00 568.00 189.33 1.00 165.67 166.67 100.00

thewacokid commented Jul 18, 2016

Perhaps this is not the correct way to push this, but SMR drives would absolutely love even a slightly sequential workload for resilvers. Current code degrades to 1-5 IOPs with SMR drives over time, which makes rebuilds take an eternity, especially with the queue depth stuck at 1 on the drive being replaced (is that a bug, or expected? I haven't had time to dig into the source).

Just to clarify, this is the SMR drive being resilvered to after a few hours (filtering out idle drives):
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.00 0.00 0.00 99.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdgt 0.00 0.00 0.00 6.00 0.00 568.00 189.33 1.00 165.67 166.67 100.00

@ironMann

This comment has been minimized.

Show comment
Hide comment
@ironMann

ironMann Jul 19, 2016

Member

@thewacokid try boosting scrub io with parameters as suggested in #4825 (comment)
I've started a discussion on openzfs-developer mailing list about this feature, and it seems there's already work started on this feature.

Member

ironMann commented Jul 19, 2016

@thewacokid try boosting scrub io with parameters as suggested in #4825 (comment)
I've started a discussion on openzfs-developer mailing list about this feature, and it seems there's already work started on this feature.

@thewacokid

This comment has been minimized.

Show comment
Hide comment
@thewacokid

thewacokid Jul 19, 2016

@ironMann Those parameters help massively with normal drives, however, SMR drives eventually (within an hour or so) degrade to a handful of IOPs as they shuffle data out to the shingles. Perhaps higher queue depths would help, or async rebuild IO, or something easier than a full sequential resilver patch? I'm unsure why there's only ever one pending IO to the target drive.

thewacokid commented Jul 19, 2016

@ironMann Those parameters help massively with normal drives, however, SMR drives eventually (within an hour or so) degrade to a handful of IOPs as they shuffle data out to the shingles. Perhaps higher queue depths would help, or async rebuild IO, or something easier than a full sequential resilver patch? I'm unsure why there's only ever one pending IO to the target drive.

@scineram

This comment has been minimized.

Show comment
Hide comment
@scineram

scineram Aug 24, 2016

@thewacokid There will be a talk next month to watch out for on this issue. Probably the work @ironMann mentioned.
http://open-zfs.org/wiki/Scrub/Resilver_Performance

scineram commented Aug 24, 2016

@thewacokid There will be a talk next month to watch out for on this issue. Probably the work @ironMann mentioned.
http://open-zfs.org/wiki/Scrub/Resilver_Performance

@mailinglists35

This comment has been minimized.

Show comment
Hide comment
@mailinglists35

mailinglists35 Sep 27, 2016

from the openzfs conference recording I understand Nexenta may be able to do it. can't wait to see this in ZoL!
http://livestream.com/accounts/15501788/events/6340478/videos/137014181
scroll to minute 15

mailinglists35 commented Sep 27, 2016

from the openzfs conference recording I understand Nexenta may be able to do it. can't wait to see this in ZoL!
http://livestream.com/accounts/15501788/events/6340478/videos/137014181
scroll to minute 15

@mailinglists35

This comment has been minimized.

Show comment
Hide comment
@mailinglists35

mailinglists35 Mar 8, 2017

what is the status of this? the link I've pasted above is no longer working.

mailinglists35 commented Mar 8, 2017

what is the status of this? the link I've pasted above is no longer working.

@mailinglists35

This comment has been minimized.

Show comment
Hide comment
@mailinglists35

mailinglists35 Mar 8, 2017

hm, PR #5153 mentions a new PR, #5841 that intends to solve #3497 which appears to do a faster resilvering.

mailinglists35 commented Mar 8, 2017

hm, PR #5153 mentions a new PR, #5841 that intends to solve #3497 which appears to do a faster resilvering.

@nwf

This comment has been minimized.

Show comment
Hide comment
@nwf

nwf Mar 9, 2017

Contributor

@mailinglists35: the dRAID stuff is different, though it happens to have similar effects.

@skiselkov has written all the code to do this; it's in review at skiselkov/illumos-gate#2 and https://github.com/skiselkov/illumos-gate/commits/better_resilver_illumos. I have ported the code over to ZoL and been testing it with delightful success (it was very straightforward, doubtless in part because ZoL strives to minimize divergence against upstream). I should assume a pull request to ZoL is forthcoming once the review is done and code gets put back to Illumos.

ETA: @skiselkov's implementation is purely in-RAM and achieves persistence by periodically draining the reorder buffer, thereby bringing the metadata recursor's state and the set of blocks actually scrubbed into sync. This is a really neat design and keeps the on-disk persistence structure fully backwards compatible with the existing records. He deserves immense praise for the work. :)

Contributor

nwf commented Mar 9, 2017

@mailinglists35: the dRAID stuff is different, though it happens to have similar effects.

@skiselkov has written all the code to do this; it's in review at skiselkov/illumos-gate#2 and https://github.com/skiselkov/illumos-gate/commits/better_resilver_illumos. I have ported the code over to ZoL and been testing it with delightful success (it was very straightforward, doubtless in part because ZoL strives to minimize divergence against upstream). I should assume a pull request to ZoL is forthcoming once the review is done and code gets put back to Illumos.

ETA: @skiselkov's implementation is purely in-RAM and achieves persistence by periodically draining the reorder buffer, thereby bringing the metadata recursor's state and the set of blocks actually scrubbed into sync. This is a really neat design and keeps the on-disk persistence structure fully backwards compatible with the existing records. He deserves immense praise for the work. :)

@skiselkov

This comment has been minimized.

Show comment
Hide comment
@skiselkov

skiselkov Mar 9, 2017

Contributor

@nwf Just an FYI, the resilver work isn't quite complete yet. I have a number of changes queued up that build in some more suggestions from Matt Ahrens from the design/early review phase. Notably a lot of the range_tree code is gonna change, as well as some of the vdev queue taskq handling. Nothing too dramatic, I just don't want you to put in a lot of work on porting and to have it then blown out by changing it a lot.

Contributor

skiselkov commented Mar 9, 2017

@nwf Just an FYI, the resilver work isn't quite complete yet. I have a number of changes queued up that build in some more suggestions from Matt Ahrens from the design/early review phase. Notably a lot of the range_tree code is gonna change, as well as some of the vdev queue taskq handling. Nothing too dramatic, I just don't want you to put in a lot of work on porting and to have it then blown out by changing it a lot.

@nwf

This comment has been minimized.

Show comment
Hide comment
@nwf

nwf Mar 9, 2017

Contributor

@skiselkov: No worries! I'm happy to follow along and start over if needed. :)

Contributor

nwf commented Mar 9, 2017

@skiselkov: No worries! I'm happy to follow along and start over if needed. :)

@thegreatgazoo

This comment has been minimized.

Show comment
Hide comment
@thegreatgazoo

thegreatgazoo Mar 9, 2017

Member

Just to clarify:

  • Resilver, and any optimization of it, works with any type of vdev, including the new dRAID vdev
  • Rebuild, a completely new mechanism added by dRAID, works only with dRAID and mirror.
Member

thegreatgazoo commented Mar 9, 2017

Just to clarify:

  • Resilver, and any optimization of it, works with any type of vdev, including the new dRAID vdev
  • Rebuild, a completely new mechanism added by dRAID, works only with dRAID and mirror.
@mailinglists35

This comment has been minimized.

Show comment
Hide comment
@mailinglists35

mailinglists35 Mar 9, 2017

thank you all!
@behlendorf could you add a milestone for this?

mailinglists35 commented Mar 9, 2017

thank you all!
@behlendorf could you add a milestone for this?

@behlendorf behlendorf added this to the 0.8.0 milestone Mar 9, 2017

@jumbi77

This comment has been minimized.

Show comment
Hide comment
@jumbi77

jumbi77 Jun 26, 2017

Referencing to #6256

jumbi77 commented Jun 26, 2017

Referencing to #6256

@interduo

This comment has been minimized.

Show comment
Hide comment
@interduo

interduo Nov 17, 2017

Thanks for this. This was a big problem for me in one location.

piwo

Will this come to 0.7.4 release ?

interduo commented Nov 17, 2017

Thanks for this. This was a big problem for me in one location.

piwo

Will this come to 0.7.4 release ?

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Nov 17, 2017

Member

Your welcome, this feature will be part of 0.8.

Member

behlendorf commented Nov 17, 2017

Your welcome, this feature will be part of 0.8.

Nasf-Fan added a commit to Nasf-Fan/zfs that referenced this issue Jan 29, 2018

Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes zfsonlinux#3625
Closes zfsonlinux#6256

Nasf-Fan added a commit to Nasf-Fan/zfs that referenced this issue Feb 13, 2018

Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes zfsonlinux#3625
Closes zfsonlinux#6256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment