Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dRAID vdev driver #5841

Closed
wants to merge 1 commit into from
Closed

Conversation

thegreatgazoo
Copy link

@thegreatgazoo thegreatgazoo commented Feb 27, 2017

The latest code has been moved to #7078

This patch implements the dRAID vdev driver and a new rebuild mechanism (#3497). Most features are complete, except:

  • Rebuild stop and resume, and persistant rebuild state.
  • Support for HW-accelerated parity routines: The dRAID code uses the raidz parity functions (generation, and reconstruction) but needs a small change. Currently the change has been made only to the original raidz parity functions, i.e. not the new HW-accelerated ones.

However, this is still work in progress: user interface may change, on-disk format may change as well. Also, there's still some crufty hacks I'm going to clean up.

I've added a dRAID howto. It contains only basics for now, but I'll continue to update the document.

Please report bugs to the dRAID project.

Comments, testing, fixes, and porting are greatly appreciated!

Code structure:

  • New code
    • module/zfs/vdev_draid.c: vdev driver for draid and draid spare
    • module/zfs/spa_scan.c: sequential rebuild, for both draid and mirror vdevs
    • cmd/draidcfg/*.[ch]: user space tools, mainly to create permutations
  • Changes to existing code
    • module/zfs/vdev_raidz.c: the parity functions need to include draid skip sectors for computation and reconstruction.
    • module/zfs/vdev_mirror.c: minor changes to support mirror_map_t allocated by draid vdev (for hybrid mirror support)
    • module/zfs/metaslab.c: to add support for draid hybrid mirror, also disallow block allocation during rebuild
    • Other changes:
      • Add knowledge about the new vdev types and the new rebuild mechanism
      • draid spare pretends to be a leaf but is actually not. Some code needs to be aware of that, e.g. handling labels on leaf vdevs.

The goal is to change existing code in a way that when draid is not in use the effective change is none. Though there's still some cleanups needed.

@mention-bot
Copy link

@thegreatgazoo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @ironMann to be potential reviewers.

@thegreatgazoo
Copy link
Author

@kpande The draidcfg can be easily integrated into zpool create. There's a separate config file now, because a) for bug reporting the vdev config can be easily uploaded, and b) I'm still improving the code so I need a way to create the configuration without actually creating a pool. This is where I'd need feedback from the community, especially admins. If the community doesn't want a separate step, then it can be removed. I'm fine either way.

As to the mirror rebuild, I chose rebuild by default simply that's the easiest way for me to test. I want to change the default back to resilver, because in general I want this patch to have 0 effective change unless draid is in use or the user has explicitly requested new mechanism. For example, if rebuild is the default for mirror, it may break user scripts that grep zpool status for string "resilver" or poll zpool events for a resilver done event.

@ahrens
Copy link
Member

ahrens commented Mar 10, 2017

@thegreatgazoo I think there should not be a separate utility; zpool create should be able to create a DRAID pool on its own.

@thegreatgazoo
Copy link
Author

@ahrens Alright, I'll integrate it into zpool create, and I'll keep draidcfg for a while as I need it to create test configurations without creating a pool.

@behlendorf
Copy link
Contributor

I think it's important that creating a draid pool feels the same as creating any other type of configuration. That means we need to somehow specify parity, spares, blocks/stripe, and vdevs as part of zpool create. How about something like this which is in the spirit of the existing commands.

zpool create <draid[1|2|3]:spares:blocks/stripe> vdevs...

For example, to create a 17 device draid pool with single parity, two spares and four blocks/stripe you'd issue the following command.

zpool create draid:2:4 /dev/sd[a-q]  # 17 devices

If the specified pool configuration doesn't satisfy the '(vdevs - spares) % (parity + blocks/stripe) == 0' requirement the command should fail with a useful error message describing the issue.

Furthermore, what would be really slick from an admin perspective is if spares and blocks/stripe were optional. We could have spares default to 1 when not specified and then make a best effort to calculate a reasonable blocks/stripe which satisfies the constraint.

For example, the following command would create a 21 device draid pool with triple parity (draid3), a default single spare and the smallest block/stripe value which works. In this case 1.

zpool create draid3 /dev/sd[a-u]  # 21 devices

Or if you wanted three spares, in which case blocks/stripe would be still be 1.

zpool create draid3:3 /dev/sd[a-u]  # 21 devices

What I like about this is that it's consistent with the existing syntax everyone is familiar with. Which I suspect means it should be reasonably straight forward to fit this in to the existing parser. Plus it allows knowledgeable users to specify the exact configuration they want and less savvy admins to quickly get a functional, but perhaps non-optimal, pool.

@thegreatgazoo would something like this be workable? Did I overlook something? Are there some alternate suggestions?

create test configurations without creating a pool.

A potentially nice way to handle this would be to make the existing zpool create dry-run option (-n) draid aware.

           -n  Displays the configuration that would be used without  actually
               creating  the pool. The actual pool creation can still fail due
               to insufficient privileges or device sharing.

@ahrens
Copy link
Member

ahrens commented Mar 11, 2017

@behlendorf I generally agree with what you're suggesting. However a few details:

As I understand it, "blocks/stripe" (BPS) is like the number of "data disks" in each parity group. Another name for this might be "data per parity". It determines the minimum allocation size, which is BPS * DiskSectorSize. Like you said, the total number of disks in the DRAID has to be N * (BPS + Nparity) + Nspare, where N is a positive integer.

Increasing BPS (decreasing N) causes better space efficiency, because it increases the ratio of data to parity. Decreasing BPS (increasing N) causes better random-IOPS performance, and better compression ratio, because the random IOPS of the whole DRAID is approximately N * IOPSofOneDisk.

I agree that Nspare should default to 1. Nspare=0 doesn't make sense because it would eliminate the benefit of DRAID (improved rebuild time due to distributed spare space).

I think that the default should be to maximize BPS (i.e. N=1). This makes DRAID similar to a single RAID-Z group (or single RAID-5 group) (but with improved rebuild time due to distributed spare space). This way if you don't specify anything, the behavior is as familiar as possible, and as appropriate as possible for smaller installations.

Minimizing BPS (BPS=1) causes DRAID to devolve into mirroring, which I imagine is rarely what the user wants. But if it is, we could have zpool create draidN behave as I suggested, and zpool create dmirrorN make a BPS=1 DRAID.

We may want to have the syntax be draid<1,2,3>[:NumSpares[:N]], i.e. to specify N rather than BPS. Then the default would be the same as draid<1,2,3>:1:1. And we can describe N as "we divide the disks into N parity groups, getting the approximate IOPS of N disks; The total number of data+parity disks must be divisible evenly by N (i.e. (Ndisks - Nspare) % N == 0)."

I could be convinced that we should do something more complicated when given enough disks that N>1 will be nearly as space efficient as N=1. However, it's likely that successfully taking advantage of such "smart" auto-configuration would require as much if not more expertise than specifying all the parameters manually.

For example, maybe DRAID1 could allow BPS as low as 10, DRAID2 -> 20, DRAID3 ->30. With this BPS, the parity is only 10% of the total space so perhaps further improvement in space efficiency is not worth it. But I would still hesitate to do this, for 2 reasons:

  1. Our idea of "space efficient enough" may not match the customer's. This is highly dependent on workload and hardware.
  2. Given the requirement that every parity group must have the same BPS, and a constant Nspare, the configuration (and thus performance and space usage) will vary wildly with small changes in the total number of disks. (e.g. DRAID1 <25 disks> would result in BPS=11 (N=2), but DRAID1 <26 disks> would result in BPS=25 (N=1). I think we'd have to relax these requirements (e.g. allow parity groups with different BPS) for any "smart" auto-configuration to be reasonable.

@thegreatgazoo
Copy link
Author

I agree draidcfg should be removed, but I'd tend not to make smart choices for the admins. There are many factors to trade off against each other, and the optimal configuration highly depends on the workload. Some thoughts:

  • Default Nspare: I think if we have to choose a default here, the preferred choice would be Nparity. Because then there'd be enough spare space to handle the maximum possible concurrent failures. But the bigger Nspare the lesser avaible space. Currently draidcfg doesn't allow Nspare to be 0. Then would it make any sense to have Nspare > Nparity? I think it depends on what the user/admin wants. If Ndrives is small, then Nspare should likely be small to maximize available space. But a large Nspare would enable the vdev to survive Nspare drive failures (not concurrent though) and still have full redundancy without adding any new drive. For example, a 10x(8D+2P)+5S vdev with less than 5% space reserved for spare, would still be able to handle IO after 7 drive failures (but at most 2 concurrently) without adding any replacement drive, because the 5S would restore full redundancy after 5 failed drives and the 2P would enable the vdev to survive another 2 drive failures. But would users want to do something like that? I don't know.
  • N and BPS: While a bigger BPS increases space efficiency by reducing the parity/data ratio, draid may add up to BPS - 1 skip sectors for each block (to enable rebuild). The actual balance depends on the actually sizes of the blocks. If data compression is off, the psize of most blocks is power of 2, then it'd make sense to use a power-of-2 BPS to eliminate draid skip sectors altogether. A smaller N and bigger BPS would also reduce IO sizes. To read any block of size s, the IO size on each drive is s/BPS. One key strength of draid is the ability to use all drives evenly during rebuild even when N bigger than 1. Also, for any single-drive failure, only about 1/N of all the blocks would need repair. With N==1 all blocks must be repaired, but with N==2 only half would need repair. The current rebuild algorithm is able to skip regions not affected by drive failures.

I think a first step would be to agree on the syntax to specify all parameters and require all to be mandatory. Then we can decide whether to allow any to be optional, and if so how to come up with a sensible default.

  • Parity is already part of the vdev name, e.g. draid2, so no need for additional specification.
  • Ndrives equals total # of drives, so need to to explicitly specify it.

So we'd only need to specify BPS and Nspare. I'd tend to prefer having explicit names for the options, e.g. zpool create draid1 -s 2 -d 4 drives.... That way, it's clear which number is what, and it'd be easy to allow any to be optional without being ambiguous.

@ahrens
Copy link
Member

ahrens commented Mar 14, 2017

I think a first step would be to agree on the syntax to specify all parameters and require all to be mandatory. Then we can decide whether to allow any to be optional, and if so how to come up with a sensible default.

Sounds good to me.

So we'd only need to specify BPS and Nspare

Or specify N and Nspare.

Up to and including b6ca80d723b4b9139d3133114a17d019149eb1a4

Signed-off-by: Isaac Huang <he.huang@intel.com>
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label May 2, 2017
@behlendorf behlendorf changed the title [WIP] dRAID vdev driver dRAID vdev driver May 2, 2017
@gmelikov
Copy link
Member

@thegreatgazoo won't you mind if I rebase it on master? I'll use it at work soon.

@thegreatgazoo
Copy link
Author

@gmelikov The code in the current PR is fairly outdated. The latest code depends on metadata allocation class PR from @don-brady to manage dRAID mirror metaslabs. We're working on getting both PRs refreshed, but it'd be more work than a simple rebase. Stay tuned!

@gmelikov
Copy link
Member

@thegreatgazoo thanks, it would be great to test it!

@gmelikov
Copy link
Member

gmelikov commented Jul 2, 2017

@thegreatgazoo can I help with tests? Now I want to add some to ZTS (main thought - change default pool to draid). If you already have any - can I ask for them?

Thank you!

@behlendorf
Copy link
Contributor

Now I want to add some to ZTS (main thought - change default pool to draid)

A great place to start would be to update the test cases verify existing functionality on striped, mirrored, and raidz pools by adding draid to the list.

@gmelikov
Copy link
Member

gmelikov commented Jul 5, 2017

@behlendorf Is it a good idea to add variable VDEV_TYPE or similar to run main part of tests on needed VDEV type?

@behlendorf
Copy link
Contributor

@gmelikov that's not a bad idea, but I don't this it's necessary. We can just update the needed test cases directly, we're not going to be adding new top-level vdev types very often.

@thegreatgazoo
Copy link
Author

@gmelikov and @behlendorf I'm waiting for #5182 to merge before I can update this PR. The latest dRAID code depends on #5182 to manage its internal mirror metaslabs. As to testing, we've modified ztest to support dRAID pools, and I believe our test team modified ZTS as well. When I refresh this PR, I'll include the test-related changes too.

@gmelikov
Copy link
Member

@thegreatgazoo thanks, waiting for it!

I'll have some tests for draid, will make a PR after this one merged if there will be some differences.

@gmelikov gmelikov mentioned this pull request Dec 1, 2017
@thegreatgazoo
Copy link
Author

The latest code has been moved to #7078

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants