dRAID vdev driver #5841

thegreatgazoo · 2017-02-27T18:54:56Z

The latest code has been moved to #7078

This patch implements the dRAID vdev driver and a new rebuild mechanism (#3497). Most features are complete, except:

Rebuild stop and resume, and persistant rebuild state.
Support for HW-accelerated parity routines: The dRAID code uses the raidz parity functions (generation, and reconstruction) but needs a small change. Currently the change has been made only to the original raidz parity functions, i.e. not the new HW-accelerated ones.

However, this is still work in progress: user interface may change, on-disk format may change as well. Also, there's still some crufty hacks I'm going to clean up.

I've added a dRAID howto. It contains only basics for now, but I'll continue to update the document.

Please report bugs to the dRAID project.

Comments, testing, fixes, and porting are greatly appreciated!

Code structure:

New code
- module/zfs/vdev_draid.c: vdev driver for draid and draid spare
- module/zfs/spa_scan.c: sequential rebuild, for both draid and mirror vdevs
- cmd/draidcfg/*.[ch]: user space tools, mainly to create permutations
Changes to existing code
- module/zfs/vdev_raidz.c: the parity functions need to include draid skip sectors for computation and reconstruction.
- module/zfs/vdev_mirror.c: minor changes to support mirror_map_t allocated by draid vdev (for hybrid mirror support)
- module/zfs/metaslab.c: to add support for draid hybrid mirror, also disallow block allocation during rebuild
- Other changes:
  - Add knowledge about the new vdev types and the new rebuild mechanism
  - draid spare pretends to be a leaf but is actually not. Some code needs to be aware of that, e.g. handling labels on leaf vdevs.

The goal is to change existing code in a way that when draid is not in use the effective change is none. Though there's still some cleanups needed.

mention-bot · 2017-02-27T18:55:00Z

@thegreatgazoo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @ironMann to be potential reviewers.

thegreatgazoo · 2017-03-10T23:20:12Z

@kpande The draidcfg can be easily integrated into zpool create. There's a separate config file now, because a) for bug reporting the vdev config can be easily uploaded, and b) I'm still improving the code so I need a way to create the configuration without actually creating a pool. This is where I'd need feedback from the community, especially admins. If the community doesn't want a separate step, then it can be removed. I'm fine either way.

As to the mirror rebuild, I chose rebuild by default simply that's the easiest way for me to test. I want to change the default back to resilver, because in general I want this patch to have 0 effective change unless draid is in use or the user has explicitly requested new mechanism. For example, if rebuild is the default for mirror, it may break user scripts that grep zpool status for string "resilver" or poll zpool events for a resilver done event.

ahrens · 2017-03-10T23:38:20Z

@thegreatgazoo I think there should not be a separate utility; zpool create should be able to create a DRAID pool on its own.

thegreatgazoo · 2017-03-11T00:35:02Z

@ahrens Alright, I'll integrate it into zpool create, and I'll keep draidcfg for a while as I need it to create test configurations without creating a pool.

behlendorf · 2017-03-11T01:10:27Z

I think it's important that creating a draid pool feels the same as creating any other type of configuration. That means we need to somehow specify parity, spares, blocks/stripe, and vdevs as part of zpool create. How about something like this which is in the spirit of the existing commands.

zpool create <draid[1|2|3]:spares:blocks/stripe> vdevs...

For example, to create a 17 device draid pool with single parity, two spares and four blocks/stripe you'd issue the following command.

zpool create draid:2:4 /dev/sd[a-q]  # 17 devices

If the specified pool configuration doesn't satisfy the '(vdevs - spares) % (parity + blocks/stripe) == 0' requirement the command should fail with a useful error message describing the issue.

Furthermore, what would be really slick from an admin perspective is if spares and blocks/stripe were optional. We could have spares default to 1 when not specified and then make a best effort to calculate a reasonable blocks/stripe which satisfies the constraint.

For example, the following command would create a 21 device draid pool with triple parity (draid3), a default single spare and the smallest block/stripe value which works. In this case 1.

zpool create draid3 /dev/sd[a-u]  # 21 devices

Or if you wanted three spares, in which case blocks/stripe would be still be 1.

zpool create draid3:3 /dev/sd[a-u]  # 21 devices

What I like about this is that it's consistent with the existing syntax everyone is familiar with. Which I suspect means it should be reasonably straight forward to fit this in to the existing parser. Plus it allows knowledgeable users to specify the exact configuration they want and less savvy admins to quickly get a functional, but perhaps non-optimal, pool.

@thegreatgazoo would something like this be workable? Did I overlook something? Are there some alternate suggestions?

create test configurations without creating a pool.

A potentially nice way to handle this would be to make the existing zpool create dry-run option (-n) draid aware.

           -n  Displays the configuration that would be used without  actually
               creating  the pool. The actual pool creation can still fail due
               to insufficient privileges or device sharing.

ahrens · 2017-03-11T21:35:05Z

@behlendorf I generally agree with what you're suggesting. However a few details:

As I understand it, "blocks/stripe" (BPS) is like the number of "data disks" in each parity group. Another name for this might be "data per parity". It determines the minimum allocation size, which is BPS * DiskSectorSize. Like you said, the total number of disks in the DRAID has to be N * (BPS + Nparity) + Nspare, where N is a positive integer.

Increasing BPS (decreasing N) causes better space efficiency, because it increases the ratio of data to parity. Decreasing BPS (increasing N) causes better random-IOPS performance, and better compression ratio, because the random IOPS of the whole DRAID is approximately N * IOPSofOneDisk.

I agree that Nspare should default to 1. Nspare=0 doesn't make sense because it would eliminate the benefit of DRAID (improved rebuild time due to distributed spare space).

I think that the default should be to maximize BPS (i.e. N=1). This makes DRAID similar to a single RAID-Z group (or single RAID-5 group) (but with improved rebuild time due to distributed spare space). This way if you don't specify anything, the behavior is as familiar as possible, and as appropriate as possible for smaller installations.

Minimizing BPS (BPS=1) causes DRAID to devolve into mirroring, which I imagine is rarely what the user wants. But if it is, we could have zpool create draidN behave as I suggested, and zpool create dmirrorN make a BPS=1 DRAID.

We may want to have the syntax be draid<1,2,3>[:NumSpares[:N]], i.e. to specify N rather than BPS. Then the default would be the same as draid<1,2,3>:1:1. And we can describe N as "we divide the disks into N parity groups, getting the approximate IOPS of N disks; The total number of data+parity disks must be divisible evenly by N (i.e. (Ndisks - Nspare) % N == 0)."

I could be convinced that we should do something more complicated when given enough disks that N>1 will be nearly as space efficient as N=1. However, it's likely that successfully taking advantage of such "smart" auto-configuration would require as much if not more expertise than specifying all the parameters manually.

For example, maybe DRAID1 could allow BPS as low as 10, DRAID2 -> 20, DRAID3 ->30. With this BPS, the parity is only 10% of the total space so perhaps further improvement in space efficiency is not worth it. But I would still hesitate to do this, for 2 reasons:

Our idea of "space efficient enough" may not match the customer's. This is highly dependent on workload and hardware.
Given the requirement that every parity group must have the same BPS, and a constant Nspare, the configuration (and thus performance and space usage) will vary wildly with small changes in the total number of disks. (e.g. DRAID1 <25 disks> would result in BPS=11 (N=2), but DRAID1 <26 disks> would result in BPS=25 (N=1). I think we'd have to relax these requirements (e.g. allow parity groups with different BPS) for any "smart" auto-configuration to be reasonable.

thegreatgazoo · 2017-03-14T17:48:33Z

I agree draidcfg should be removed, but I'd tend not to make smart choices for the admins. There are many factors to trade off against each other, and the optimal configuration highly depends on the workload. Some thoughts:

Default Nspare: I think if we have to choose a default here, the preferred choice would be Nparity. Because then there'd be enough spare space to handle the maximum possible concurrent failures. But the bigger Nspare the lesser avaible space. Currently draidcfg doesn't allow Nspare to be 0. Then would it make any sense to have Nspare > Nparity? I think it depends on what the user/admin wants. If Ndrives is small, then Nspare should likely be small to maximize available space. But a large Nspare would enable the vdev to survive Nspare drive failures (not concurrent though) and still have full redundancy without adding any new drive. For example, a 10x(8D+2P)+5S vdev with less than 5% space reserved for spare, would still be able to handle IO after 7 drive failures (but at most 2 concurrently) without adding any replacement drive, because the 5S would restore full redundancy after 5 failed drives and the 2P would enable the vdev to survive another 2 drive failures. But would users want to do something like that? I don't know.
N and BPS: While a bigger BPS increases space efficiency by reducing the parity/data ratio, draid may add up to BPS - 1 skip sectors for each block (to enable rebuild). The actual balance depends on the actually sizes of the blocks. If data compression is off, the psize of most blocks is power of 2, then it'd make sense to use a power-of-2 BPS to eliminate draid skip sectors altogether. A smaller N and bigger BPS would also reduce IO sizes. To read any block of size s, the IO size on each drive is s/BPS. One key strength of draid is the ability to use all drives evenly during rebuild even when N bigger than 1. Also, for any single-drive failure, only about 1/N of all the blocks would need repair. With N==1 all blocks must be repaired, but with N==2 only half would need repair. The current rebuild algorithm is able to skip regions not affected by drive failures.

I think a first step would be to agree on the syntax to specify all parameters and require all to be mandatory. Then we can decide whether to allow any to be optional, and if so how to come up with a sensible default.

Parity is already part of the vdev name, e.g. draid2, so no need for additional specification.
Ndrives equals total # of drives, so need to to explicitly specify it.

So we'd only need to specify BPS and Nspare. I'd tend to prefer having explicit names for the options, e.g. zpool create draid1 -s 2 -d 4 drives.... That way, it's clear which number is what, and it'd be easy to allow any to be optional without being ambiguous.

ahrens · 2017-03-14T19:12:57Z

I think a first step would be to agree on the syntax to specify all parameters and require all to be mandatory. Then we can decide whether to allow any to be optional, and if so how to come up with a sensible default.

Sounds good to me.

So we'd only need to specify BPS and Nspare

Or specify N and Nspare.

Up to and including b6ca80d723b4b9139d3133114a17d019149eb1a4 Signed-off-by: Isaac Huang <he.huang@intel.com>

gmelikov · 2017-06-28T18:55:00Z

@thegreatgazoo won't you mind if I rebase it on master? I'll use it at work soon.

thegreatgazoo · 2017-06-28T23:01:53Z

@gmelikov The code in the current PR is fairly outdated. The latest code depends on metadata allocation class PR from @don-brady to manage dRAID mirror metaslabs. We're working on getting both PRs refreshed, but it'd be more work than a simple rebase. Stay tuned!

gmelikov · 2017-06-29T03:57:16Z

@thegreatgazoo thanks, it would be great to test it!

gmelikov · 2017-07-02T17:05:43Z

@thegreatgazoo can I help with tests? Now I want to add some to ZTS (main thought - change default pool to draid). If you already have any - can I ask for them?

Thank you!

behlendorf · 2017-07-03T21:19:32Z

Now I want to add some to ZTS (main thought - change default pool to draid)

A great place to start would be to update the test cases verify existing functionality on striped, mirrored, and raidz pools by adding draid to the list.

gmelikov · 2017-07-05T06:23:59Z

@behlendorf Is it a good idea to add variable VDEV_TYPE or similar to run main part of tests on needed VDEV type?

behlendorf · 2017-07-05T17:06:39Z

@gmelikov that's not a bad idea, but I don't this it's necessary. We can just update the needed test cases directly, we're not going to be adding new top-level vdev types very often.

thegreatgazoo · 2017-07-20T22:33:09Z

@gmelikov and @behlendorf I'm waiting for #5182 to merge before I can update this PR. The latest dRAID code depends on #5182 to manage its internal mirror metaslabs. As to testing, we've modified ztest to support dRAID pools, and I believe our test team modified ZTS as well. When I refresh this PR, I'll include the test-related changes too.

gmelikov · 2017-07-21T07:20:31Z

@thegreatgazoo thanks, waiting for it!

I'll have some tests for draid, will make a PR after this one merged if there will be some differences.

thegreatgazoo · 2018-01-24T17:23:05Z

The latest code has been moved to #7078

ahrens self-requested a review February 27, 2017 20:15

thegreatgazoo requested review from behlendorf and ironMann February 27, 2017 20:45

ahrens added the OpenZFS Review label Mar 1, 2017

thegreatgazoo requested review from dpquigl and don-brady March 8, 2017 05:50

mailinglists35 mentioned this pull request Mar 8, 2017

Implement sequential (two-phase) resilvering #3625

Closed

thegreatgazoo force-pushed the draid branch from aaa67a2 to a4191d2 Compare March 10, 2017 04:15

Port of draid from coral-beta-draid branch

9c89682

Up to and including b6ca80d723b4b9139d3133114a17d019149eb1a4 Signed-off-by: Isaac Huang <he.huang@intel.com>

thegreatgazoo force-pushed the draid branch from b42afd6 to e9341b3 Compare April 18, 2017 17:36

behlendorf added the Status: Work in Progress Not yet ready for general review label May 2, 2017

behlendorf changed the title ~~[WIP] dRAID vdev driver~~ dRAID vdev driver May 2, 2017

gmelikov mentioned this pull request Dec 1, 2017

draidcfg utility #6915

Closed

thegreatgazoo force-pushed the draid branch from 6e39308 to 9c89682 Compare January 24, 2018 16:46

thegreatgazoo closed this Jan 24, 2018

thegreatgazoo deleted the draid branch January 24, 2018 16:55

carlesmateo mentioned this pull request Feb 6, 2018

(low priority, not critical) -o ashift=N no longer supported, but requested, then only numeric values accepted thegreatgazoo/zfs#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dRAID vdev driver #5841

dRAID vdev driver #5841

thegreatgazoo commented Feb 27, 2017 •

edited

mention-bot commented Feb 27, 2017

thegreatgazoo commented Mar 10, 2017

ahrens commented Mar 10, 2017

thegreatgazoo commented Mar 11, 2017

behlendorf commented Mar 11, 2017

ahrens commented Mar 11, 2017

thegreatgazoo commented Mar 14, 2017

ahrens commented Mar 14, 2017

gmelikov commented Jun 28, 2017

thegreatgazoo commented Jun 28, 2017

gmelikov commented Jun 29, 2017

gmelikov commented Jul 2, 2017

behlendorf commented Jul 3, 2017

gmelikov commented Jul 5, 2017

behlendorf commented Jul 5, 2017

thegreatgazoo commented Jul 20, 2017

gmelikov commented Jul 21, 2017

thegreatgazoo commented Jan 24, 2018

dRAID vdev driver #5841

dRAID vdev driver #5841

Conversation

thegreatgazoo commented Feb 27, 2017 • edited

mention-bot commented Feb 27, 2017

thegreatgazoo commented Mar 10, 2017

ahrens commented Mar 10, 2017

thegreatgazoo commented Mar 11, 2017

behlendorf commented Mar 11, 2017

ahrens commented Mar 11, 2017

thegreatgazoo commented Mar 14, 2017

ahrens commented Mar 14, 2017

gmelikov commented Jun 28, 2017

thegreatgazoo commented Jun 28, 2017

gmelikov commented Jun 29, 2017

gmelikov commented Jul 2, 2017

behlendorf commented Jul 3, 2017

gmelikov commented Jul 5, 2017

behlendorf commented Jul 5, 2017

thegreatgazoo commented Jul 20, 2017

gmelikov commented Jul 21, 2017

thegreatgazoo commented Jan 24, 2018

thegreatgazoo commented Feb 27, 2017 •

edited