New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Allocation Classes #5182

Merged
merged 1 commit into from Sep 6, 2018

Conversation

@don-brady
Copy link
Member

don-brady commented Sep 27, 2016

ZFS Allocation Classes [WIP]

Allocation Classes

Allocation classes can be thought of as allocation tiers that are dedicated to specific block categories. Now in addition to general data (normal class), and Intent Log data (log class), we introduce a special class (metadata, DDT, and optionally small file blocks) and a dedup class (dedup data only).

Feature Flag Encapsulation

The feature@allocation_classes becomes active when a special or dedup allocation class is instantiated for a VDEV. Activating this feature makes the pool read-only on builds that don't support allocation classes.

Allocation Class Granularity

The class allocation bias occurs at a top-level VDEV granularity. Each top-level VDEV now has an allocation bias.

Special VDEVs

All metaslabs in the VDEV are dedicated to the special allocation class category. A pool must always have at least one general (non-specified) VDEV when using special VDEVs. Opt-in using the special VDEV class designation when creating the VDEV.

Dedup VDEVs

All metaslabs in the VDEV are dedicated to the dedup allocation class category. A pool must always have at least one general (non-specified) VDEV when using dedup VDEVs. Opt-in using the dedup VDEV class designation when creating the VDEV.

Example Syntax:
zpool create demo raidz <disks> special mirror <disks>
zpool add demo dedup <disk>

New Ditto Block Policy

When there is only one VDEV available and more than one DVA is required (ditto copies > 1), the traditional ditto placement policy was to place the allocation a distance of 1/8th of total vdev allocation space away from the other DVAs. This policy has been simplified to a simple guarantee that the other DVAs land in a different metaslab. To validate that the new policy is honored, a zdb(8) block audit will also report any DVAs that landed in the same metaslab (expecting none). If there is a policy failure it will be manifest as follows in zdb output:
Dittoed blocks in same metaslab: 21

VDEV Changes

Classes show up in run-time VDEV instance as an allocation bias:

typedef enum vdev_alloc_bias {
    VDEV_BIAS_NONE,
    VDEV_BIAS_LOG,    /* dedicated to ZIL data (SLOG) */
    VDEV_BIAS_SPECIAL, /* dedicated to metadata and small blocks */
    VDEV_BIAS_DEDUP /* dedicated to dedup data */
	} vdev_alloc_bias_t;

This vdev allocation class bias is stored in the per-vdev zap object as a string value:

/* vdev metaslab allocation bias */
#define VDEV_ALLOC_BIAS_LOG.           "log"
#define VDEV_ALLOC_BIAS_SPECIAL        "special"
#define VDEV_ALLOC_BIAS_DEDUP           "dedup"

The bias is also passed internally in the pool config nvlist during a pool create and any internal pool config query. This is used by functions in the zpool(8) command.

Observing Allocation Classes

There are several ways to observe aspects of allocation classes using the zpool(8), zdb(8) and kstat

ZPOOL

$ sudo cmd/zpool/zpool status smoke                                             
  pool: smoke                                                                   
 state: ONLINE                                                                  
  scan: none requested                                                          
config:                                                                         
                                                                                
NAME                                            STATE     READ WRITE CKSUM      
smoke                                           ONLINE       0     0     0      
  ata-ST2000VN004-2E4164_Z523F7B5               ONLINE       0     0     0      
  ata-ST2000VN004-2E4164_Z523F4SX               ONLINE       0     0     0      
  ata-ST2000VN004-2E4164_Z523G89V               ONLINE       0     0     0      
dedup                                                                           
  ata-ST3500312CS_5VVSH444                      ONLINE       0     0     0      
special                                                                         
  nvme-KXG50ZNV256G_Z7EF700IFQ9S                ONLINE       0     0     0      
logs                                                                            
  ata-ST2000VN004-2E4164_Z523F4VP                            0     0     0      
                                                                                
errors: No known data errors

ZDB

ZDB has been adapted to accommodate allocation class information:

  • '-m' Shows the bias assigned to a VDEV
  • '-b' Shows the allocation class size summary

zdb -b <pool> -- class summary and ditto policy (shows any same-vdev ditto blocks that landed in the same metaslab).

	        No leaks (block sum matches space maps exactly)
	
	        bp count:                602424
	        ganged count:                 0
	        bp logical:         18845081600      avg:  31282
	        bp physical:        18475993600      avg:  30669     compression:   1.02
	        bp allocated:       19405235712      avg:  32211     compression:   0.97
	        bp deduped:                   0    ref>1:      0   deduplication:   1.00
	        Normal class:       18499772416     used:  4.63%
	        Special Class:        905463296     used: 87.05%
	
	        additional, non-pointer bps of type 0:        642
	        Dittoed blocks on same vdev: 27127

ZTEST Coverage

  • Occasionally add a special dedicated vdev
  • Occasionally disable small blocks in special class
  • a new '-C' option allows control over special vdev creation (see ztest(8) -?)

Caveats and TBD

  • At this point there are no custom allocation policies and all classes use the default DNF allocator.
@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Sep 28, 2016

mirror required? I tried
zpool create -f lcl sdb3 metadata sdd3
but if fails reporting invalid vdev specification: 'ddt,dmu,mos' class expects mirror
this however works:
zpool create -f lcl sdb3 metadata mirror sdd3 sdc3


if (vd->vdev_mg->mg_class[SPA_CLASS_DDT] != NULL) {
(void) strncat(bufptr, VDEV_CLASS_DDT",", buflen);
buflen -= sizeof (VDEV_CLASS_DDT);

This comment has been minimized.

@rlaager

rlaager Sep 30, 2016

Contributor

This is decrementing correctly because sizeof(VDEV_CLASS_DDT) includes the trailing NUL, which makes up for the comma that you are not counting, right?

if (vd->vdev_mg->mg_class[SPA_CLASS_DDT] != NULL)
(void) strcat(class, VDEV_CLASS_DDT",");

if (strlen(class) == 1)

This comment has been minimized.

@rlaager

rlaager Sep 30, 2016

Contributor

Maybe this, for consistency with later code (and avoiding calling strlen() twice)? if (class[1] != '\0')

@@ -745,6 +745,9 @@ dump_metaslab_stats(metaslab_t *msp)
dump_histogram(rt->rt_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);

This comment has been minimized.

@rlaager

rlaager Sep 30, 2016

Contributor

If you are going to add a copyright line in other places, maybe add one to zdb.c as well?

{"OI COMSTAR ", 8192},
{"SUN COMSTAR ", 8192},
{"NETAPP LUN ", 8192},
{"OI COMSTAR ", 8192},

This comment has been minimized.

@rlaager

rlaager Sep 30, 2016

Contributor

This looks wrong. If the GitHub reviewer is showing this correctly, you're improperly replacing spaces with tabs here.

@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Oct 3, 2016

Capacity regression? When creating a pool, even without any allocation classes, about 4 % of available capacity is lost:

Plain ZFS (commit df7c405).

# cat /proc/partitions | grep sdb3
   8       19   10485760 sdb3
# cmd/zpool/zpool create -f lcl sdb3
# cmd/zpool/zpool iostat -v 1 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
lcl          236K  9,94G      3     24   214K   792K
  sdb3       236K  9,94G      3     24   214K   792K
----------  -----  -----  -----  -----  -----  -----
# cmd/zfs/zfs create lcl/t1
# dd if=/dev/zero of=/lcl/t1/zeros1 bs=10M
dd: fel vid skrivning av ”/lcl/t1/zeros1”: Enheten är full
986+0 poster in
985+0 poster ut
10329915392 byte (10 GB) kopierade, 110,859 s, 93,2 MB/s

This (commit fbcb38b)

# cat /proc/partitions | grep sdb3
   8       19   10485760 sdb3
# cmd/zpool/zpool create -f lcl sdb3
# cmd/zpool/zpool iostat -v 1 1
                capacity     operations     bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
lcl            248K  9,50G      1     14   112K   414K
  sdb3 {any}   248K  9,50G      1     14   112K   414K
------------  -----  -----  -----  -----  -----  -----
# cmd/zfs/zfs create lcl/t1
# dd if=/dev/zero of=/lcl/t1/zeros1 bs=10M
dd: fel vid skrivning av ”/lcl/t1/zeros1”: Enheten är full
942+0 poster in
941+0 poster ut
9876144128 byte (9,9 GB) kopierade, 132,981 s, 74,3 MB/s
@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Oct 6, 2016

Free space accounting. As I suppose there is some issue - an idea that perhaps could work. But I've not tested it on this patch. (And it is late night here...)

Approach would be a minimalistic change from plain ZFS accounting:

  • The metadata classes introduces the ability for each vdev to be part of multiple allocation classes.
    So lets account all changes to a vdev (alloc, defer, space, dspace) to all classes it belongs to.

    Then there would be no changes needed for the allocation or freeing paths except for the loop over classes.

  • What matters is what is reported to outside the metaslabs in terms of used and free space. With one modification (below), just report values for the {any} allocation class. There must in all cases be an {any} class, and it is reasonably also the largest capacity class. This way, there is no double accounting reported to the outside.

  • In order to handle the case when the vdevs for other classes fill up: before returning a value for dspace, calculate the fill ratio of all allocation classes. If another class than {any} has a higher fill ratio, modify the value returned for dspace to be as if {any} has that higher fill ratio (given the alloc value it actually has).

    This modification is perhaps not strictly necessary, but prevents a user from filling up the metadata devices and then unknowingly starting to store metadata on the slow {any} devices. So perhaps as a tunable? It does however not prevent ZFS from using the {any} device as fallback to get out of a hairy situation when disks are (almost) full and even a remove requires lots of new data blocks.

One drawback might be the slight user confusion that the capacity reported (by e.g. df) does not change at all when metadata devices are added. But then, to know how much metadata capacity is available, some ZFS tools must be used anyhow.

@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Oct 8, 2016

I have looked over the changes, and one thing worries me rather deeply:

If I understand things correctly, the amount of space used for the different kinds of data (reg, log, mos, ddt, dmu) is recorded permanently in the space maps?

While being nice for monitoring purposes, it means that the mapping of any kind of block would forever be tied to the now given kind of class? Otherwise, the accounting when freeing items will be done in the wrong class. In particular, would it not also give problems when adding metadata classes to an existing pool (that has not had it before)?

I can see that it is nice to have some way of knowing how much space data of each kind is eating on each vdev. One way would be that e.g. scrub accumulates that information. Would not be exact, and generally a bit out of date, but give a good enough hint. Perhaps even a reduced scrub that does not check all file data, just walks metadata structures. Would be fast, if all that on solid-state media. This could be an later improvement then?

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Oct 9, 2016

@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Nov 17, 2016

@don-brady I am thinking of testing some things with your patch, do you perhaps already have an update of the patch that solves some of the merge conflicts?

@leelists

This comment has been minimized.

Copy link

leelists commented Dec 1, 2016

It's time to get this rebased over abd :-)

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 8, 2016

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 10, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 10, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 10, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 10, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 11, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 11, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 13, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

inkdot7 added a commit to inkdot7/zfs that referenced this pull request Dec 13, 2016

Capability to give different allocation thresholds for data and metad…
…ata.

Also include blocks with level > 0 i metadata category (from zfsonlinux#5182).

Example:

zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 4 kB
mixed (mirror) takes data (or metadata) <= 64 kB
others (hdd) takes remainder

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

pure ssd drive takes metadata <= 128 kB and data <= 4 kB
mixed (mirror) takes data <= 64 kB  [this metadata already taken by ssd]
others (hdd) takes remainder

@adilger adilger referenced this pull request Feb 15, 2017

Closed

Metadata Allocation Class #3779

@don-brady don-brady force-pushed the don-brady:metadata_classes_wip branch from fbcb38b to dac90f3 Feb 22, 2017

@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Mar 5, 2017

Thanks @don-brady for the update!

What is the reason for requiring dedicated VDEVs to be mirrors?
I do not find any explanation in the code. Things also seems to work without the requirement.

@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Mar 7, 2017

mac_wip_mar5

Attached are graphs of measurements for some different configurations:
(plain ZFS is commit ebd9aa8, Pool Allocation Classes 96e4723)

(green) PAC patched ZFS, 10 GB HDD + 1 GB SSD dedicated metadata
(red) PAC patched ZFS, 10 GB HDD + 1 GB SSD dedicated metadata and small blocks
(blue) PAC patched ZFS, 10.1 GB HDD partition, segregated metadata
(black) PAC patched ZFS, 10.1 GB HDD partition, segregated metadata and small blocks
(cyan dashed) PAC patched ZFS, 10.1 GB SSD partition (should be like plain ZFS)
(magenta dashed) PAC patched ZFS, 10.1 GB HDD partition (should be like plain ZFS)
(cyan) plain ZFS, 10.1 GB SSD partition
(magenta) plain ZFS, 10.1 GB HDD partition

The test program fills the filesystem on the pool in steps where it first deletes ~1 GB of files, and then creates 1 GB (but trying to add 2 GB). Thus it after a few iterations run the pool full (vertical dashed line), and then 'ages' it by trying to cause more and more fragmentation. The files are of random size, with an exponential distribution (many small ones), averaging 200 kB. On average 20 files per directory in the varying tree, also favouring few. The pool is exported and imported between each particular test of each iteration.

The segregated tests reach the filled-state situation a bit earlier as the available space is reduced by what is set aside for segregated blocks.

Findings in these tests:

  • Segregating metadata improves both file writing and 'find'-like operations (listing files).
  • Segregating also small blocks does not improve further. Here it made find and scrub slightly worse.
  • A dedicated SSD for metadata improves file read access (find+md5sum), and scrub times.
  • Using dedicated space also for small blocks gives a further improvement for file read access.
@inkdot7

This comment has been minimized.

Copy link
Contributor

inkdot7 commented Mar 15, 2017

@don-brady I have looked through the code, and have some questions and comments:

  • It seems a vdev can only have one allocation bias?

    I also tested, but were not able to get it to store e.g. both metadata and dedup tables on one vdev; if doing

    cmd/zpool/zpool create -f lcl sdc3 metadata dedup mirror sde3 sdd3

    but not enabling dedup on the filesystem, there was nothing stored on the sde3/sdd3 mirror. Likewise, reversing order and using

    cmd/zpool/zpool create -f lcl sdc3 dedup metadata mirror sde3 sdd3

    Whether enabling dedup for the filesystem or not, the amount of space used on the sde3/sdd3 mirror was the same, i.e. I presume no dedup data.

    I guess it should be a common case to use a single SSD vdev for both metadata and dedup tables.

  • The space accounting still bothers me.
    Especially how it makes changing particulars of the assignments of various kinds of blocks difficult.

    As far as I understand, it is mainly used to tell the user how much of each category has been allocated.

    Although not exactly the same, what matters would be how much space is used/available on each dedicated vdev. This would be available without any accounting in this patch (simple summing would be enough).

    The other use I found is in a soft-limit of the metaslab class selector. But that ought to also be able to work with the actual amount of free space in the dedicated vdevs instead of amount of used space of particular kinds.

    Is there something I am missing?

  • Small blocks - it would make sense with more categories of this:
    Suppose one wants to store really small blocks on NVME-devices, and then medium ones on cheaper SSDs, before going to HDDs for the big ones.

    Also, the thresholds for these should be adjustable. This would be an important knob to actually be able to influence how much dedicated space is used (by future allocations).

@don-brady don-brady force-pushed the don-brady:metadata_classes_wip branch from f0d657c to d8aa87e Sep 1, 2018

Stale

@GregorKopka

This comment has been minimized.

Copy link
Contributor

GregorKopka commented Sep 4, 2018

I'm a little late on this, sorry for that.

This feature, in the end, boils down to just giving the allocator a hint about on which vdev(s) it should examine the metaslabs to get the free space needed for a new block of a certain data class.

The current inability to later add/remove the dedication to a special allocation class after zpool create/vdev addition is IMHO a huge problem and quite a limitation given that we can neither shrink nor remove a vdev. Having the setting fixed (and eternal) carved into stone at zpool/vdev creation would make this feature a nightmare to use (at least in deployment scenarios with changing requirements).

Thus these hints to the allocator need IMHO to be runtime configureable, preferably through vdev properties that would enable something like zpool set allocation=<comma separated list of [data,metadata,ddt,smallblocks,...]> <pool>/<vdev> (posibly order denoting in what priority which information should be stored there), which would give the administrator flexibility (so there is no need to either scrap the pool should requirements/load change, or to overspec it massively in the first place to cover eventualities). Such an implementation (vdev properties) could also be the base to implement other interesting features (see #3810) that would need to be configured per vdev (-member).

So should this, for the start, be only set-able at zpool create/add: please build it in a way that it can be transparently upgraded later (with vdev properties, or however they'll be called) to be configureable, as flexible as possible, at runtime. I would prefer if it would be flexible from the start.

This feature should also not make the pool read-only on zfs versions without it.
As there is the fallback to use any pool-wide available space in case a specialized vdev runs out (at least I understand it that far) there should be no need for this limitation - at least for the proposed functionality which effectively only is a modified allocation strategy. This would be different in case it would also eg. rid us of the need to have dedicated vdevs for CACHE and SLOG (by turning these into special allocation classes that can coexist with other data classes as normal metadata or ddt).

@don-brady

This comment has been minimized.

Copy link
Member Author

don-brady commented Sep 4, 2018

@GregorKopka Thanks for your feedback. Let me try and address some of your concerns.

With PR-6900 (zfs device evacuation/removal) you can now remove a top-level VDEV like a special class mirror. So if you deploy a special class VDEV and decide later that you don't need it or want to reuse that device for something else, removal is certainly possible. [EDIT] won't work yet if normal class is raidz. Looking into alternatives...

The VDEV bias (aka its dedication), as you noted is essentially an allocation policy for that device. There is nothing in this design that would preclude a future change that would allow the user to add/change the allocation bias associated with a given VDEV. A follow up PR could certainly accommodate a zpool set … to change the bias and/or disable the feature entirely. There is additional future work, like that suggested by @adilger to accommodate the primarycache property for the special class. I see this current PR as a foundation and anticipate more refinements over time.

I'm not sure I agree that this feature should not make a pool read-only for implementations that are allocation class unaware. For example, consider a pool provisioned to have a large generic raidz vdev and a much smaller special SDD to hold the metadata for the pool. Importing said pool as writeable under an allocation class unaware implementation would potentially fill up the SSD with file data and in essence undo the intended isolation of the metadata in that device.

@GregorKopka
Copy link
Contributor

GregorKopka left a comment

A feature should not break existing functionality,
this implementation will disable zpool split.

Show resolved Hide resolved module/zfs/spa.c Outdated
@@ -25,6 +25,7 @@
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
* Copyright 2013 Saso Kiselkov. All rights reserved.
* Copyright (c) 2017 Datto Inc.
* Copyright (c) 2017, Intel Corporation.

This comment has been minimized.

@GregorKopka

GregorKopka Sep 4, 2018

Contributor

The copyright either contains a superfluous comma or aims to extend to the end of time.

This comment has been minimized.

@don-brady

don-brady Sep 5, 2018

Author Member

All previous Intel commits used this format as well as other companies. I cannot say that it's right or wrong, but it has been used in past commits to our repo.

@@ -140,6 +141,14 @@ struct vdev_queue {
kmutex_t vq_lock;
};

typedef enum vdev_alloc_bias {

This comment has been minimized.

@GregorKopka

GregorKopka Sep 4, 2018

Contributor

Having this as an enum allows only one option.

It would make more sense to implement this as a reasonably futureproof sized bitfield, allowing to directly configure which multiple kinds of metadata are to be allocated from that special vdev.

This would completely remove the need to determine if ddt is metadata or not through a module parameter.

Would also ease later expansion of the feature to eg. differentiate between different kinds of data blocks with a finer granularity. Just one example: allocation only of the 1st copy of a metadata block on a fast special device (for speed) while any additional copies allocate either from other special vdev that are slower but optimized for small blocks or from the normal vdevs (for extremely cheap redundancy).

This comment has been minimized.

@don-brady

don-brady Sep 4, 2018

Author Member

The vdev bias is stored on-disk as a string. The enum is just for runtime

@GregorKopka

This comment has been minimized.

Copy link
Contributor

GregorKopka commented Sep 4, 2018

@don-brady Thanks for taking the time to answer - and for working on this in the first place.

Regading the read-only I have always seen the feature flags as a way to protect the on-disk state against implementations that would potentially destroy data as of not understanding the new structure, as a solution to a technical problem. In case an implementation just tacks additional data to existing structures that will safely be ignored by non-aware implementations there should IMHO be no reason to apply this in case it isn't actually needed.

You have a point with importing an allocation class aware pool in an unaware system leading to data being spread to places it wouldn't end up in otherwise. But I think that point is moot as executing a vdev removal would do exactly that: spread the contents of the to-be removed vdev over the pool to whatever places where big spans of space are currently free. At least in the current incarnation where just the device to be removed can be specified (with the rest of the pool being the target to search for free room), regardless of the problem with raidz vdevs.

And regardless of device removal being available at the moment, executing it once on a special device would be orders of magnitude worse than mounting a pool on an allocation class unaware system (eg. to make a snapshot and send stuff away, something that would most likely only happen should there be problems with the pool or the system it was originally running on) as it would move at rest data to new places while the mounting would only affect newly written data.

Bottom line on this: Please don't close a potential disaster recovery path should a technical reason for this not exist.

Regarding the expandability I suggest to structure the on-disk control data in a way that makes extension easy. With that I mean to make it open by allowing multiple classes to be specified per vdev in the first place. See the comment in my review regarding the enum.

This would solve the issues I have seen (between the lines of some comments in this discussion) where people have different thoughts about what actual classes of data should be counted as 'metadata' (or combine with each other).

Show resolved Hide resolved module/zfs/spa_misc.c Outdated
Pool allocation classes
Allocation Classes adds the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Signed-off-by: Don Brady <don.brady@delphix.com>

@don-brady don-brady force-pushed the don-brady:metadata_classes_wip branch from d8aa87e to f8f93ce Sep 5, 2018

@behlendorf

This comment has been minimized.

Copy link
Member

behlendorf commented Sep 6, 2018

@don-brady thanks for refreshing this and incorporating all the feedback. This PR is ready to go. I put it through a final round of comprehensive testing and wasn't able to uncover any problems.

@GregorKopka thanks for taking an interest and making the time to review and comment. Your core concerns have been addressed in the finalized version. See the comment above, but to summarize.

  • The on-disk format change is stored flexibly as a string and can be extended in the future as needed
  • Pools with dedup|special devices may be split using zpool split.
  • Pools can be mounted read-only by ZFS implementations which do not support allocation classes.

Future work:

  • Allow the feature to be manually disabled with zfs set feature@allocation_classes=disabled.
  • Allow special|dedup devices to be removed when the primary pool is configured as raidz.
  • Distribute multi-mount writes evenly.
  • Make secondarycache property aware of allocation classes, #5182 (comment)

@behlendorf behlendorf merged commit cc99f27 into zfsonlinux:master Sep 6, 2018

20 of 22 checks passed

buildbot/Debian 9 x86_64 (TEST) Build done.
Details
buildbot/Ubuntu 17.04 x86_64 Coverage (TEST) Build started.
Details
buildbot/Amazon 2 x86_64 (BUILD) Build done.
Details
buildbot/Amazon 2 x86_64 Release (TEST) Build done.
Details
buildbot/CentOS 6 x86_64 (BUILD) Build done.
Details
buildbot/CentOS 6 x86_64 (TEST) Build done.
Details
buildbot/CentOS 7 x86_64 (BUILD) Build done.
Details
buildbot/CentOS 7 x86_64 (TEST) Build done.
Details
buildbot/Debian 8 arm (BUILD) Build done.
Details
buildbot/Debian 8 ppc (BUILD) Build done.
Details
buildbot/Debian 8 ppc64 (BUILD) Build done.
Details
buildbot/Debian 9 x86_64 (BUILD) Build done.
Details
buildbot/Fedora 28 x86_64 (BUILD) Build done.
Details
buildbot/Fedora 28 x86_64 (TEST) Build done.
Details
buildbot/Kernel.org Built-in x86_64 (BUILD) Build done.
Details
buildbot/Ubuntu 16.04 aarch64 (BUILD) Build done.
Details
buildbot/Ubuntu 16.04 i386 (BUILD) Build done.
Details
buildbot/Ubuntu 16.04 x86_64 (BUILD) Build done.
Details
buildbot/Ubuntu 16.04 x86_64 (TEST) Build done.
Details
buildbot/Ubuntu 17.10 x86_64 (STYLE) Build done.
Details
buildbot/Ubuntu 18.04 x86_64 (BUILD) Build done.
Details
buildbot/Ubuntu 18.04 x86_64 (TEST) Build done.
Details
@codecov

This comment has been minimized.

Copy link

codecov bot commented Sep 6, 2018

Codecov Report

Merging #5182 into master will increase coverage by 0.06%.
The diff coverage is 90.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5182      +/-   ##
==========================================
+ Coverage   78.51%   78.57%   +0.06%     
==========================================
  Files         376      376              
  Lines      113561   114001     +440     
==========================================
+ Hits        89159    89577     +418     
- Misses      24402    24424      +22
Flag Coverage Δ
#kernel 78.86% <88.4%> (+0.04%) ⬆️
#user 67.67% <90.16%> (-0.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27ca030...f8f93ce. Read the comment docs.

@bgly

This comment has been minimized.

Copy link

bgly commented Nov 28, 2018

If you have a special allocation class for metadata, are the L1/L2/...etc indirect blocks stored in the separate device separated for metadata?

Dataset vpool8/DedupTest1 [ZPL], ID 1283, cr_txg 334, 13.4T, 406 objects, rootbp DVA[0]=<5:c42d6dc00:200> DVA[1]=<5:961130c00:200> [L0 DMU objset] sha256 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=127462L/127462P fill=406 cksum=e37e6ab7686d7f69:2178d6222c905eda:12934be0edc6776e:14f59b255bd1a249

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
      1794    3   128K   128K  34.4G     512    50G  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                               176   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 409599
        path    /Dedupfile99.0.0
        uid     0
        gid     0
        atime   Tue Nov 20 18:50:56 2018
        mtime   Tue Nov 20 19:04:44 2018
        ctime   Tue Nov 20 19:04:44 2018
        crtime  Tue Nov 20 18:50:56 2018
        gen     15452
        mode    100644
        size    53687091200
        parent  34
        links   1
        pflags  840800000004
Indirect blocks:
               0 L2   5:62c27a000:4e00 5:3dcaf7a00:4e00 20000L/4e00P F=409600 B=15589/15589
               0  L1  5:90487a400:bc00 5:450455400:bc00 20000L/bc00P F=1024 B=15452/15452
               0   L0 1:14000a9b000:21000 20000L/15a00P F=1 B=15452/354
           20000   L0 0:20000d50000:21000 20000L/15200P F=1 B=15452/354
           40000   L0 0:20000d2f000:21000 20000L/15200P F=1 B=15452/354
           60000   L0 0:20000d92000:21000 20000L/15600P F=1 B=15452/354
@DeHackEd

This comment has been minimized.

Copy link
Contributor

DeHackEd commented Nov 28, 2018

Yes. From the output above I can see that you have a pool with a geometry roughly like:

vpool
  raidz-0
    [disks]
  raidz-1
    [disks]
  raidz-2
    [disks]
  raidz-3
    [disks]
  raidz-4
    [disks]
special
  mirror-5
    [disks]

(Obviously I'm guessing raidz, but I'm pretty confident about the mirror)

You can tell because the L2 and L1 blocks, and the dataset on the top row, begin the DVAs with a prefix of 5: very consistently even though ZFS is biased to try to split the duplicate blocks onto different disks. That's the metadata vdev. Which is how I'm able to guess your pool geometry.

@bgly

This comment has been minimized.

Copy link

bgly commented Nov 28, 2018

Yes you are pretty close, I have:

  pool: vpool8
 state: ONLINE
  scan: none requested
config:

	NAME                                      STATE     READ WRITE CKSUM
	vpool8                                    ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730c9168  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730ced0c  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730df5c8  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730cd5cc  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27327f990  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27325e598  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27325edac  ONLINE       0     0     0
	  raidz2-1                                ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca26add6ae4  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27319b86c  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273294f00  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273223b88  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27322cd58  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2732664b4  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730e17f8  ONLINE       0     0     0
	  raidz2-2                                ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730e3480  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2731291dc  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273279300  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27323f5a0  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273291c60  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2732f5c88  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273366a8c  ONLINE       0     0     0
	  raidz2-3                                ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273126404  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730d00f4  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730fdafc  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2732220cc  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730c94f0  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730dfa74  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27334edf8  ONLINE       0     0     0
	  raidz2-4                                ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730ecab0  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273080ccc  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2730d9b88  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273128c58  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca2731b0478  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca27312826c  ONLINE       0     0     0
	    vsencryptdisk-scsi-35000cca273277a98  ONLINE       0     0     0
	dedup
	  md0p3                                   ONLINE       0     0     0
	special
	  md0p2                                   ONLINE       0     0     0
	cache
	  md0p5                                   ONLINE       0     0     0
	spares
	  scsi-35000cca26b4826ec                  AVAIL

So the prefix will basically tell me which disk the data sits in? When you guessed 5 was "special" would dedup be 6? etc? How do you get more precise information? @DeHackEd

@DeHackEd

This comment has been minimized.

Copy link
Contributor

DeHackEd commented Nov 28, 2018

This is going off topic into mailing list territory. I'm going to answer this specific question and then stop.

# zdb -l /dev/md0p5
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'vpool8'
    state: 0
    ...
    vdev_children: 7           <-- total top-level devices, non-spare and non-cache
    vdev_tree:
        type: 'disk'
        id: 5                  <-- This disk's position (starting from 0)
        ...
@bgly

This comment has been minimized.

Copy link

bgly commented Jan 16, 2019

If i use the metadata allocation classes, and run a zpool upgrade will my old pools migrate to this new metadata allocation class or do i need to remake all my pools to take advantage of this new feature?

@DeHackEd

This comment has been minimized.

Copy link
Contributor

DeHackEd commented Jan 16, 2019

The usual. Either zpool upgrade or zpool set feature@allocation_classes=enabled $POOLNAME, then you can do zpool add [-n] $POOLNAME special mirror $DISK1 $DISK2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment