Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotor vector allocation (small records favour SSD) #4365

Closed
wants to merge 14 commits into from

Conversation

inkdot7
Copy link
Contributor

@inkdot7 inkdot7 commented Feb 24, 2016

Proof-of-concept! This is not a pull request as such, but (Update: operational) for review / intended to spur discussion, and show some possible performance increases.

These commits implement a simplistic strategy to make small records (e.g. metadata, but also file data) end up on faster non-rotating storage when a pool consist of two vdevs with different characteristics e.g. an SSD (mirror) and a HDD (raidzn).

For a test case (see commit log) with many small files, times for 'find' were cut by a factor 8, and scrub with a factor 3. More details in the commit log.

Note: Although I tried to search the web and read various articles etc, I have not really understood all the intricacies of space allocation, so presume there are unhandled cases, and possibly other issues with the approach. Feel free to point such out.

The purpose is the same as #3779.

@inkdot7 inkdot7 force-pushed the rotor_vector branch 2 times, most recently from 1555df6 to 73095a8 Compare February 25, 2016 13:10
@richardelling
Copy link
Contributor

I believe the metadata allocation classes is a more comprehensive approach #3779
If you would like to contribute, we'd greatly appreciate the help

@inkdot7
Copy link
Contributor Author

inkdot7 commented Feb 25, 2016

@richardelling I agree that metadata allocation classes are more comprehensive than this 'hack'. I was not able to find any pointers to code however, in the issue ticket or the slides?

Another question would be if the metadata allocation classes also can be told to store small files, as that would be very useful when e.g. running some grep command over a source code tree.

@inkdot7 inkdot7 force-pushed the rotor_vector branch 2 times, most recently from 40b1210 to 3d9ad70 Compare September 1, 2016 16:40
@inkdot7
Copy link
Contributor Author

inkdot7 commented Sep 1, 2016

Updated rotor vector implementation:

  • Handle free space accounting to be limited by most filled rotor category.
  • Slight change of allocation behaviour: only try slower categories after selected category completely failed. (To make it behave as usual when only one category is present.)

(Since there were no comments on the earlier code as such, I replaced it with this new version, to avoid cluttering the history.)

The earlier patch did not behave at all when the pool got full. This is now fixed by doing the mc_{alloc,deferred,space,dspace} accounting in metaslab_class by rotor vector category. When their values are requested from elsewhere, the first three behave as before and are simply summed up. When several categories exist, the total dspace value returned is adjusted such that when the pool is approaching full, > 75%, the value mimics the behaviour as if all categories have the same fill ratio as the most full category. This will make zfs report disk full to user applications in a controlled fashion in the normal manner. When < 25 %, the full capacity is reported, and in-between a sliding average. This actually also make dd(1) return sensible numbers. :)

I believe this is the correct way to go: if one has e.g. one small SSD for small allocations and large HDDs for the rest, one does not want the allocations to spill over on the other kind of device as the pool becomes full. After the user has freed some space, the misdirected allocations would stay where were written, thereby hampering performance for the entire future of the pool. Better then to report full a bit earlier, and retain performance.

Measurements:

rotor_vector_sep1
Attached are some graphs of measurements for six different configurations:

(green) patched ZFS, 10 GB HDD + 1 GB SSD partitions
(red) patched ZFS, 10.1 GB HDD partition
(blue) patched ZFS, 10.1 GB SSD partition
(black) plain ZFS (commit 9907cc1), 10 GB HDD + 1 GB SSD partitions
(magenta) plain ZFS (commit 9907cc1), 10.1 GB HDD partition
(cyan) plain ZFS (commit 9907cc1), 10.1 GB SSD partition

(10.1 GB when testing alone, as the tests when full with rotor vector in action did use ~70 MB on the SSD in addition to 9.7 GB on the HDD.)

The test program fills the filesystem on the pool in steps where it first deletes ~1 GB of files, and then creates 1 GB (but trying to add 2 GB). Thus it after a few iterations run the pool full (vertical dashed line), and then 'ages' it by trying to cause more and more fragmentation. The files are of random size, with an exponential distribution (many small ones), averaging 200 kB. On average 20 files per directory in the varying tree, also favouring few. The pool is exported and imported between each particular test of each iteration.

In all the measured operations, the patch makes the addition of a small SSD to an HDD (array) perform much better than the HDD alone. Some measurements have several curves as I did several test runs.

rotor_vector_sep1_zoom
Also attached are zoomed graphs for the pure HDD case. The patch should not affect the performance of this at all. But it does look like the red curve (with patch) is a bit lower than the magenta (plain ZFS). Might be a fluke of the measurements, will have to try to improve them a bit to investigate this.

Generalisation of technique:

It would be a quite simple expansion of the patch to include more categories of vdevs, from faster to cheaper:

pure SSD mirror
SSD+HDD mirror (gives SSD style read performance by #4334)
pure HDD mirror
pure HDD raidz

Thus a user can choose a configuration that gives the best value with available resources, and for larger systems even a performance-optimised multi-tiered storage within the same pool becomes possible.

The above categories can be auto-identified by zfs without any hints. It would still be user option (module parameters) to control above which record sizes to choose cheaper vdevs.

Todo:

  • Improve the measurements?
    (Also measure alternative implementations?)

  • Populating the rotor categories according to a pool attribute instead of statically?

    What would be a suitable form? A string like: "ssd:16000:ssd+hdd:100000:hdd-mirror,hdd-raidz" ?

    Here the numbers give cut-off values, and several categories may be put into the same rotor category with a comma?

    (I have found that saying ssd and hdd is much easier to keep track of, even if the code internally uses the term rotational/non-rotational).

    One question would be: are such attributes available before metaslab_group_create() is called? Otherwise, the groups have to be recategorised after the content of the attribute is available.

    Perhaps also a way to specify particular vdevs?

    E.g. since the pure SSD mirror category is not able to separate these two classes:

    • mirror of 1 super-performant (PCIe) SSD + 'normal' SSDs
    • mirror of 'normal' SSDs

    Or similarly within the pure HDD mirror category.

  • Pass which rotor category each vdev belongs to zpool(8), such that the user can verify a setup.

  • Pass a hint to the allocator if an allocation is for metadata. Then one could make the selection of rotor category depend on that too.

    (Except for that, it is intentional in this patch that small files can end up on the low-latency storage. More performance than only optimising the location of metadata.)

  • (more) ztest testing.

For the performance increase possible, I think the patch is really small, so am hoping to spur some feedback :-)

@inkdot7 inkdot7 force-pushed the rotor_vector branch 4 times, most recently from 7055c82 to 73fd7f8 Compare September 12, 2016 03:36
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 12, 2016
@inkdot7 inkdot7 force-pushed the rotor_vector branch 5 times, most recently from 7344532 to 717adea Compare September 26, 2016 01:40
@inkdot7 inkdot7 mentioned this pull request Sep 27, 2016
@inkdot7
Copy link
Contributor Author

inkdot7 commented Sep 28, 2016

Updates:

  • Rotor category assignment based on size can be done independently for metadata blocks and other (normal) data blocks. (The selection of which blocks are which needs more work though...)

    Different kinds of blocks can be stored on the same vdev. It is e.g. possible to store all metadata on fast storage, together with other blocks < n kB.

  • Rotor categories are now set by a pool property, called rotorvector for the time being.

    With this, it can be selected both which categories of disks (ssd, ssd-raidz, mixed, hdd, hdd-raidz) shall be part of a particular category, as well as forcefully assigning vdevs by their guid.

    The property also tell how large blocks are intended for each category.

  • There are 5 rotors available, to be able to construct flexible tiered storage arrangements.

Note: the assignment of vdevs to rotor categories is made on pool import. Except for the configuration pool property, there is no change of the storage format. The rotor categories can thus be easily changed, and take effect for any new allocations after an export/import cycle.

@inkdot7
Copy link
Contributor Author

inkdot7 commented Sep 28, 2016

More measurements. Same procedure and devices used as above.
rotor_vector_sep28
Nine configurations:

curve configuration
(green) SSD(all metadata) + HDD(rest)
(red) SSD(all metadata + data <= 4 kB) + HDD (rest)
(red dashed) SSD(metadata <= 16 kB) +
mixed(HDD+SSD mirror)(metadata + data <= 16 kB) + HDD(rest)
(black) SSD(any <= 16 kB) + HDD(rest)
(black dashed) SSD(any <= 4 kB) + HDD(rest)
(black dotted) SSD(any <= 2 kB) + HDD(rest)
(blue) WIP Metadata Allocation Classes #5182, SSD(metadata) + HDD(rest)
(magenta) HDD
(cyan) SDD

The plain HDD and SSD configurations are for comparison / setting the scale of the figures.

It is nice to note that the green and blue curves, which are two different approaches, but that should be dividing the blocks in the same way between the devices indeed show very similar behaviour.

Since SSDs are expensive and the purpose of the method is to use it 'wisely', some variations on block assignment schemes are tried. Except for the (delete)+write graphs, they are normalised to the total amount of data stored. Still it would matter how much SSD is used per HDD - in these tests (numbers are in GB, from zpool iostat):

SSD MIX HDD
rv S(m) 0.034 + 9.63
rv S(m,d<4) 0.048 + 9.63
rv S(m<16)+M(m,d<16) 0.035 + 0.095 + 9.63
rv S(<16) 0.071 + 9.63
rv S(<4) 0.047 + 9.63
rv S(<2) 0.027 + 9.63
metadata_classes_wip 0.083 + 9.12
plain HDD ?
plain SSD ?

The number for metadata_classes_wip are a bit odd. It reports full for a smaller amount of data stored. Actually the capacity of both the SSD and HDD partitions are reported smaller by ~5% after pool creation than normally. It also seems to use more space for the metadata. The latter issue may have something to do with the pool creation not accepting a single metadata disk but requiring a mirror, where I then detached a placeholder partition, to get a comparable setup. I am aware that it is a very fresh commit, so will try to update this post when issue is resolved / explained. The slightly smaller available space possibly also explains the earlier drop of the blue curve in the find graph.

With a few exceptions, most curves are very similar. Outliers:

  • Black dotted (SSD only storing < 2kB blocks), generally worse. Makes sense, as less SSD is used than in the other cases.
  • Black solid and dashed, lower scrub speeds. The likely explanation is that these do not store complete metadata on an SSD. Except for scrub, they show similar or even improved performance.
  • Red (and black) solid and dashed, improved find+md5sum (i.e. general read) performance.
    With a considerable improvement: plain HDD goes from ~12 MB/s to 23 MB/s with metadata on SSD. By also storing small files on SSD, 31-33 MB/s is reached, i.e. another 35%. The additional storage space can even be provided by a less costly SSD+HDD mirror configuration (red dashed).

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@inkdot7
Copy link
Contributor Author

inkdot7 commented Oct 2, 2016

It is interesting to know the metadata overhead of files and directories in order to choose a suitable vdev (SSD) size. To get numbers for that, I created a number of dummy files, either directly in a zfs filesystem or within directories under that, and recorded the allocated capacity on the metadata vdev. The remaining allocated capacity after the filesystem was purged (rm -rf) was subtracted; this gives more consistent numbers.

The resulting metadata size (as used on a SSD, 512 byte blocks) is then assumed to be a sum of five parameters (they were solved for using linear least squares fitting):

Parameter Value
Bytes/empty file (0 size) 70.5
Additional bytes/non-empty file 77.4
Bytes/directory 120.5
Additional bytes/long filename (>= 50 chars) 8885
Additional bytes/long directory name (>= 50 chars) 8867

There seems to be a rather hefty penalty for long filenames. The measurements did not show any gradual change below 50 characters.

For each file of non-zero size, there was one block written to the vdev holding data. This started at a file size of 1 bytes.

A test by copying my home filesystem to a meta+data pool gives 94.8M of metadata for 29.0G of data, for 263110 files and 29175 directories. Or 360 bytes/file.

In addition to the above comes the overhead of block pointers for each block for large files. This should be 128 bytes / block. With the default maximum of 128 kB/block, this is 1 ‰, or 1 GB per TB.

The measurement data is included below. The column residual tell how well the measured value matches the estimate using the parameters above.

#files  #files(>50 chars)  #dirs  #dirs(>50 chars)   #sz>0  metasz   residual
50000   0                  0      0                  50000   8.03    0.6
10000   0                  0      0                  0       0.82    0.1
50000   0                  0      0                  0       3.56    0.03
10000   0                  0      0                  10000   1.92    0.4
50000   0                  0      0                  50000   7.37   -0.02
10000   0                  100    0                  10000   1.29   -0.2
50000   0                  500    0                  50000   7.05   -0.4
50000   0                  500    0                  0       3.25   -0.4
50000   0                  50500  0                  0      10.56    0.9
50000   0                  100500 0                  0      15.26   -0.4
50000   50000              100500 0                  0     460       0.1
10000   0                  0      0                  0       0.91    0.2
10000   10000              20100  0                  0      91.6    -0.4
10000   10000              20100  0                  0      91.8    -0.2
10000   0                  20100  10000              0      91.8     0.0
10000   0                  20100  0                  0       3.11   -0.02
10000   0                  20100  0                  0       3.14    0.01
50000   0                  100500 0                  0      15.3    -0.3
50000   0                  500    0                  50000   6.98   -0.5
50000   0                  100500 0                  50000  19.74    0.2
10000   0 (20 chars)       20100  0                  0       3.14    0.01
10000   0 (40 chars)       20100  0                  0       3.12    0.0    

@behlendorf
Copy link
Contributor

@inkdot7 thanks for the reminder! I've added @ahrens as a reviewer to get his thoughts on this from an OpenZFS perspective.

@ahrens
Copy link
Member

ahrens commented Jan 24, 2017

High-level questions (apologies if these are addressed already, feel free to point me elsewhere):

How is this behavior controlled? Is user intervention required? What is the "rotorvector" pool property? Is it intentionally not documented in the zpool.8 manpage? How do we determine if a disk is SSD vs HDD? How do we determine which allocations to place on SSD? What happens if one class gets full? How is all of this communicated to the user (e.g. how can they tell which devices are SSD vs HDD, and how can they tell how full each type is)?

Does this provide any additional functionality than the "vdev allocation classes" project (#3779) that @don-brady is working on?

Only mirrors are mixed.  If a pool consist of several mixed vdevs, it is
mixed if all vdevs are either mixed, or fully nonrotational.

Do not mark as mixed when fully nonrotational.
Preparation for selecting metaslab depending on allocation size.

Renaming of mc_rotor -> mc_rotorv and mc_aliquot -> mc_aliquotv just to
make sure all references are found.
Todo: The guid list should not have a fixed length but be dynamic.
Format:

[spec<=limit];spec

with spec being a comma-separated list of vdev-guids and generic type
specifiers: ssd, ssd-raidz, mixed, hdd, and hdd-raidz.

limit gives the limit in KiB of which allocations are allowed within the
rotor vector.  The last rotor has no limit.  (And vdevs which are not
matched by the guids or the generic types are placed in the last rotor.)

Example:

ssd<=4;mixed<=64;123,hdd

Here, allocations less than 4 kbytes are allocated on ssd-only vdev(s)
(mirror or not). Allocations less than 64 kbytes end up on ssd/hdd mixed
(mirrors, such raidz makes no sense). Other allocations end up on
remaining disks. 123 represents av vdev guid (placing an explicit
vdev-guid in the last rotor makes little sense though).

Possibly, the configuration should be split into multiple properties,
one per rotor.  And limits separate from types.  The compact format does
have advantages too...
In a pool that consist of e.g. a small but fast SSD-based mirror and a
large but long-latency HDD-based RAIDZn, it is useful to have the
metadata, as well as very small files, stored on the SSD.  This is
handled in this patch by selecting the storage based on the size of the
allocation.

This is done by using a vector of rotors, each of which is associated
with metaslab groups of each kind of storage.  If the preferred group is
full, attempts are made to fill slower groups.  Better groups are not
attempted - rationale is that an almost full filesystem shall not spill
large-size data into the expensive SSD vdev, since that will not be
reclaimable without deleting the files.  Better then to consider the
filesystem full when the large-size storage is full.

One can also have e.g. a 3-level storage: Mirror SSD for really small
records, mirror HDD for medium-size records and raidzn HDD for the bulk
of data.  Currently, 5 rotor levels can be set up.

** The remainder of the commit message is for an earlier incarnation.
Numbers should be representative nontheless.  See PR for more up-to-date
measurements. **

Some performance numbers:

Tested on three separate pools each consisting of a 20 GB SSD partition
and a 100 GB HDD partition, from the same disks.  The HDD is 2 TB in
total.)  SSD raw reads: 350 MB/s, HDD raw reads 132 MB/s.

The filesystems were filled to ~60 % with a random directory tree, each
with random 0-6 subdirectories and 0-100 files, maximum depth 8. The
filesize was random 0-400 kB.  The fill script was run with 10 instances
in parallel, aborted at ~the same size.  The performance variations
below are much larger than the filesystem fill differences.

Setting 0 is the original 7a27ad0 commit.  Setting 8000 and 16000 is the
value for zfs_mixed_slowsize_threshold, i.e. below which size data is
stored using rotor[0] (nonrotating SSD), instead of rotor[1] (rotating
HDD).  ** Note: current patch does not use zfs_mixed_slowsize_threshold
or fixed rotor assignment per vdev type. **

-              Setting 8000  Setting 16000  Setting 0
-              ------------  -------------  ------------

Total # files  305666        304439         308962
Total size     75334 kB      75098 kB       75231 kB

As per 'zfs iostat -v':

Total alloc    71.8 G        71.6 G         71.7 G
SSD alloc      3.34 G        3.41 G         3.71 G
HDD alloc      68.5 G        68.2 G         68.0

Time for 'find' and 'zpool scrub' after fresh 'zfs import':

find           5.6 s         5.5 s          42 s
scrub          560 s         560 s          1510 s

Time for serial 'find | xargs -P 1 md5sum' and
parallell 'find | xargs -P 4 -n 10 md5sum'.
(Only first 10000 files each)

-P 1 md5sum    129 s         122 s          168 s
-P 4 md5sum    182 s         150 s          187 s
(size summed)  2443 MB       2499 MB        2423 MB

---

** Some reminders about squashed fixes: **

Must decide on rotor vector index earlier, in order to do space
accounting per rotor category.

Set metaslab group rotor category at end of vdev_open().
Then we do it before the group is activated.
Can then get rid of metaslab_group_rotor_insert/remove also.

Fixups:

Moving the metaslab_group_set_rotor_category() up.

ztest did an (unreproducible) failure with stack trace of
spa_tryimport() -> ... vdev_load() -> ... vdev_metaslab_init() ->
metaslab_init() -> ... metaslab_class_space_update() that failed its
ASSERT.  Inspection showed that vdev_metaslab_init() would soon call
metaslab_group_activate(), i.e. we need to assign mg_nrot.
(Hopefully, vdev_open was called earlier...?)

Assign mg_nrot no matter how vdev_open() fails.

In dealing with yet another spa_tryimport failure.
Instead of refusing to at all use better rotor categories than selected,
try them when all other failed.

The implementation is not very pretty.  And the better ones should be
tried in reverse order. At least does get it a bit further through
ztest.
Adjust the dspace reported from metaslab_class such that we look full
when one rotor vector category (e.g. SSD or HDD) becomes full.  This
since large content for the HDD should not spill into the SSD (quickly
filling it up prematurely), and as we also do not want to spill small
content that should be on the SSD onto the HDD.
metadata.

Also include blocks with level > 0 i metadata category (from PR 5182).

Example:

zpool set "rotorvector=123,ssd<=meta:4;mixed<=64;hdd" <poolname>

Pure ssd (and explicit vdev guid 123) drive takes metadata <= 4 kB.
Mixed (mirror ssd+hdd) takes data (or metadata) <= 64 kB.
Others (hdd) takes remainder.

Example II:

zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname>

Pure ssd (and explicit vdev guid 123) drive takes metadata <= 128 kB
and data <= 4 kB.
Mixed (mirror) takes data <= 64 kB (this metadata already taken by ssd).
Others (hdd) takes remainder.
The random assignment is a dirty hack.  Reason for this is that it has
to be set before the device open finishes, or the rotor vector index
will be assigned on whatever initial nonrot value the device has.
…e or mixed.

Keep track of mixed nonrotational (ssd+hdd) devices.
Only mirrors are mixed.  If a pool consist of several mixed vdevs, it is
mixed if all vdevs are either mixed, or ssd (fully nonrotational).

Pass media type info to zpool cmd (mainly whether devices are solid state, or rotational).
Info is passed in ZPOOL_CONFIG_VDEV_STATS_EX -> ZPOOL_CONFIG_VDEV_MEDIA_TYPE.
For the time being, abusing the -M flag (of the previous commit).
@inkdot7
Copy link
Contributor Author

inkdot7 commented Jan 25, 2017

Thanks @ahrens for the quick good questions!

Is it intentionally not documented in the zpool.8 manpage?

Thanks! Doing that sorted some explanations out, added a commit.

I anyhow answered the questions below too, might be easier to discuss. (I called class "category" in an attempt to not confuse with #3779/#5182 while discussing.)

What is the "rotorvector" pool property?

It sets both which vdevs that belong to each class (or category) as well as which blocks are eligible for each category.

(I suppose it should eventually use a feature@ property?)

How is this behaviour controlled?

Completely through the "rotorvector" property. No other tunables. Removing/clearing this setting makes allocations behave as usual.

(Currently, it is necessary to perform an export-import cycle to make a new/cleared setting take effect, as I have not figured out how to repopulate the metaslab / what locking is required.)

Is user intervention required?

Yes, setting the property.

How do we determine if a disk is SSD vs HDD?

This information was already available from the kernel in zfsonlinux, so I used that. Explicit vdev guids can also be given in the rotorvector property.

If possible, it would be good if one could optionally set a category (rotor vector) index associated with each vdev, which take precedence over any generic assignment by the pool property.

How do we determine which allocations to place on SSD?

By the kind of the allocation, and the size of it.

(It currently uses the compressed size, which I think is not very good. May lead to vdev hopping for data of compressed files whose compression ratio make the size straddle a cut-off point.)

What happens if one class gets full?

This I think is the most important question for any approach to this kind of feature.

With this approach, the pool is treated as full when one category is full (but not really exhausted, as usual). This is done by modifying the value reported by metaslab_class_get_dspace. See first part (before picture) of comment 1 Sep 2016. This is the only accounting change of this patchset.

I have not penetrated ZFS' accounting enough to figure if this causes any problems...? (Applying this did however bring ztest and zfstest failures down to what as far as I see are the current 'random' background for zfsonlinux.) For the user, reports from e.g. df make sense - as one category approaches full, the total (and available) space of a filesystem shrinks gradually.

Internally, all free space can be used until exhausted.

How is all of this communicated to the user (e.g. how can they tell which devices are SSD vs HDD,

An optional (currently flag -M) column showing the media type is added to zpool status and iostat. This has been submitted separately also (#5117), as I thought it could be useful on its own too.

For testing, this patch currently abuses this flag to also show which rotor vector index (i.e. category) each vdev has been assigned to in zpool iostat.

and how can they tell how full each type is)?

Not implemented yet, one idea would be to e.g. under the zpool iostat output also show a separate list of the categories, and the capacity of each: total, allocated and free. It would also be easy to mark which is currently the one mostly used etc.

Once say 10% of pool capacity is used (and one thus could assume that some usage pattern can be inferred), it would also make sense that the hints at the beginning of zpool status give a warning if some non-last category is using a larger fraction of space than the last (cheap) category.

With the power and flexibility given to the user to direct data to different categories also comes a need to provide mechanisms to easily monitor the multi-dimensional free space. More suggestions?

Does this provide any additional functionality than the "vdev allocation classes" project (#3779) that @don-brady is working on?

It gives the ability to also store (small) data blocks on the fast devices. And has the flexibility to direct data (or metadata) to (multiple) different storage classes depending on block size. E.g. a hierarchy of vdevs: SSD, HDD mirror, HDD raidz...

(Currently it does not distinguish deduplication tables within the metadata kind of allocations, but adding a separate limit for that should be easy within the selection routine.)

@grizzlyfred
Copy link

grizzlyfred commented Feb 1, 2017

@inkdot7 Thanks for pointing me here, I just glanced at it. I never thought of mixing rotary and SSD in one pool except for l2arc, but the ideas are cool though.

Also I think in the long run, rotary storage will disappear. For lack of an m.2 port and free PCI slots, I am quite happy with a 3 vdev pool of 3 ssds as primary bulk storage for the time being. But for enterprise applications it might be interesting...

Reading also across don bradys metdata issue: is that not like btrfs where we can do like -m = raid1 -d raid0. To me it would be absolutely logical to have more control over metadata. I think for large setups, there should be a way to define "vdev classes" and e.g even specify that a given zfs on creation or later should prefer a certain "location" - also something btrfs somehow has with "rebalancing" and the ability to remove a disk from a pool. ZFS as the (for me) more relieable and more performant system would benefit from such features --- until ultimately, flash will be as fast as ram and as cheap as hdd, that is.
Also, "real" tiered storage would involve moving data around (as in "btier"). It was always said "ZFS is not a cluster fs" but as a logical consequence, one should think if zfs couldn't be made something similar to e.g. a docker swarm that can be distributed/replicated.... including restraints, where a certain "container" == zfs-subsystem shall live and how often it shall be replicated etc. Or is it a waste to think it further, one would see it as local backbone for whatever cloud-based tired fancy datamoving storage one would layer over it?

@ahrens
Copy link
Member

ahrens commented Feb 1, 2017

@inkdot7 Thanks for taking the time to answer my questions.

I think that @don-brady's project now includes support for separating out small user data blocks, so it includes a lot of the functionality of this project.

In terms of space accounting, I think we need to be able to spill over from one type of device to another. Otherwise we may have lots of free space (on the larger type), then write a little bit of data (which happens to go to the smaller type), and then be out of space.

@inkdot7
Copy link
Contributor Author

inkdot7 commented Feb 6, 2017

I think that @don-brady's project now includes support for separating out small user data blocks, so it includes a lot of the functionality of this project.

@don-brady can you confirm this?

In terms of space accounting, I think we need to be able to spill over from one type of device to another. Otherwise we may have lots of free space (on the larger type), then write a little bit of data (which happens to go to the smaller type), and then be out of space.

@ahrens I suppose you mean from a user-visible point of view, as to when the file-system should refuse further writes? (Internally, from the old release, I believe both approaches spill over as long as some category has free space.)

The question would then be for how long ZFS should allow spilling? Until all space used? But that would mean that the general performance for content written after the small/fast storage got full is that of the large/slow device types. In contradiction with the performance purpose of both approaches.

Personally, I would want ZFS to here refuse further user writes until the situation is resolved (either by deleting files or providing more small/fast capacity). However, I can see that other use cases may accept the performance degradation and prefer ZFS to spill indefinitely, i.e. until large/slow space is full.

Fortunately, both policies should be able to easily co-exist - by adjusting how dspace is reported. In the latter case as the space of the large/slow device (alternatively all space, if also spilling to faster devices is allowed). In the former case, reporting it scaled with the fill factor of the relatively most used storage category (as in this project).

Naturally, the best is if the admin/user notices the filling-up situation long before it is critical and takes action. The most likely cause of the small/fast storage filling up early is a too generous cut-off for small user data blocks going it. Thus it is also important to be able to modify the policy and cut-off of eligible blocks for existing vdevs, and not only allow setting policy on initial vdev creation.

@behlendorf
Copy link
Contributor

@inkdot7 thanks for giving this problem so much thought.

As @ahrens mentioned the goal is that @don-brady's meta data allocation classes work should comprehensively address many of the same the issues you've uncovered in this PR. If there are specific use cases which you're concerned aren't handled let's discuss it so we can settle on a single implementation which works for the expected use cases.

For example, as long as the code doesn't get too ugly I could see us adding a property to more finely control the spilling behavior on a per-dataset basis. Refusing writes when a class is fully consumed and allowing spilling both sound like reasonable policies? Are there other user configurable options we should be considering?

We definitely want to make sure we get things like the command line tools right. That means making them both easy to use and flexible enough to support any reasonable allocation class configuration. Both @ahrens and I also agree that this functionality needs to have good reporting features so it's possible to assess how your storage is actually getting used. We could even consider going as far as surfacing a warning in zpool status when things are really out of balance.

@don-brady since we have multiple people keenly interested in this functionality it would be great if you could update the #5182 PR with your latest code. It looks like the version posted is about 4 months old and I know you've improved things since then!

If you could add some example command output while you're at it I'm sure we'd all find it helpful. Updating the PR would make it possible for @inkdot7, @tuxoko and myself to provide constructive feedback. Feel free to open a new PR if that makes more sense and close the original one.

@inkdot7 let's close this PR and focus on getting the work in #5182 in to a form we're all happy with.

@don-brady
Copy link
Contributor

@inkdot7 - Thanks for your effort here and the collected data. As noted by others we added support for small blocks as well as a host of other recommended changes and some simplifications to the allocation class feature.
@behlendorf - We are finishing up feedback, rebasing to latest master, and running tests. Per your request we'll post a WIP update soon. It might be quicker for the WIP update to document the CLI usage and observation in a simple markdown document than in the zpool.8 man page. Is there a convention for a document location or attachment?

@behlendorf
Copy link
Contributor

@don-brady thanks, I think posting markdown in the comment section of the PR would be fine. Eventually, once the dust settles we'll of course need to update the man pages.

@behlendorf behlendorf closed this Feb 8, 2017
@inkdot7
Copy link
Contributor Author

inkdot7 commented Feb 13, 2017

@behlendorf Except for spilling policy, and data-block-to-metadata-vdev size cut-off, I have no ideas for user configuration options. But would need to see the code and understand how it behaves first. (@don-brady To see operating principles, a rebase to latest master would not be needed.)

The last few days another use case did occur to me:

Assume one wants to build a RAIDZ pool of hdds, and use some ssd to speed up at least metadata reads. As the ssd will be a separate vdev, it would have no redundancy, unless one invests in multiple ssds for a mirror. However, since ZFS can store blocks at multiple locations, it should be possible to store on the ssd blocks that i) have multiple copies, and ii) only the first copy. Redundancy is then provided by the RAIDZ vdev.

If it is ensured that any block stored on the ssd vdev also exist on another vdev, then the pool should be able to fully survive the complete loss of the ssd vdev? I have not tested to see how ZFS would currently handle the case of such a non-critical loss of an entire vdev?, but could see me liking this use case a lot so thought I'd mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants