Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS low throughput on rbd based vdev #3324

Open
alitvak69 opened this issue Apr 20, 2015 · 26 comments
Open

ZFS low throughput on rbd based vdev #3324

alitvak69 opened this issue Apr 20, 2015 · 26 comments
Labels
Bot: Not Stale Override for the stale bot Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem

Comments

@alitvak69
Copy link

Sorry for double post. I know fixing 0.6.4 takes priority over everything else, however I decided to post this question in hope that one of the developers will give me a hint where to start.

When testing our ceph cluster I found a very strange problem. When I create a zfs file system on the top of rbd dev /dev/rbd0, no matter what tweaks I do I cannot exceed 30 MB / Sec on 1 Gbit pipe. Set sync disabled has no effect. When I use xfs on the same device I come close to saturating 1 Gbit, i.e. I am writing at 109 MB / sec

I don't have compression enabled on zfs so I could see a real throughput.

Can some one help to explain this?

zfs get all rbdlog2/cephlogs
NAME PROPERTY VALUE SOURCE
rbdlog2/cephlogs type filesystem -
rbdlog2/cephlogs creation Sun Apr 19 9:46 2015 -
rbdlog2/cephlogs used 4.62G -
rbdlog2/cephlogs available 995G -
rbdlog2/cephlogs referenced 4.62G -
rbdlog2/cephlogs compressratio 1.00x -
rbdlog2/cephlogs mounted yes -
rbdlog2/cephlogs quota none default
rbdlog2/cephlogs reservation none default
rbdlog2/cephlogs recordsize 32K inherited from rbdlog2
rbdlog2/cephlogs mountpoint /cephlogs local
rbdlog2/cephlogs sharenfs off default
rbdlog2/cephlogs checksum fletcher4 inherited from rbdlog2
rbdlog2/cephlogs compression off default
rbdlog2/cephlogs atime off inherited from rbdlog2
rbdlog2/cephlogs devices on default
rbdlog2/cephlogs exec on default
rbdlog2/cephlogs setuid on default
rbdlog2/cephlogs readonly off default
rbdlog2/cephlogs zoned off default
rbdlog2/cephlogs snapdir hidden default
rbdlog2/cephlogs aclinherit restricted default
rbdlog2/cephlogs canmount on default
rbdlog2/cephlogs xattr sa inherited from rbdlog2
rbdlog2/cephlogs copies 1 default
rbdlog2/cephlogs version 5 -
rbdlog2/cephlogs utf8only off -
rbdlog2/cephlogs normalization none -
rbdlog2/cephlogs casesensitivity sensitive -
rbdlog2/cephlogs vscan off default
rbdlog2/cephlogs nbmand off default
rbdlog2/cephlogs sharesmb off default
rbdlog2/cephlogs refquota none default
rbdlog2/cephlogs refreservation none default
rbdlog2/cephlogs primarycache metadata local
rbdlog2/cephlogs secondarycache metadata inherited from rbdlog2
rbdlog2/cephlogs usedbysnapshots 0 -
rbdlog2/cephlogs usedbydataset 4.62G -
rbdlog2/cephlogs usedbychildren 0 -
rbdlog2/cephlogs usedbyrefreservation 0 -
rbdlog2/cephlogs logbias throughput local
rbdlog2/cephlogs dedup off default
rbdlog2/cephlogs mlslabel none default
rbdlog2/cephlogs sync disabled inherited from rbdlog2
rbdlog2/cephlogs refcompressratio 1.00x -
rbdlog2/cephlogs written 4.62G -
rbdlog2/cephlogs logicalused 4.62G -
rbdlog2/cephlogs logicalreferenced 4.62G -
rbdlog2/cephlogs snapdev hidden default
rbdlog2/cephlogs acltype off default
rbdlog2/cephlogs context none default
rbdlog2/cephlogs fscontext none default
rbdlog2/cephlogs defcontext none default
rbdlog2/cephlogs rootcontext none default
rbdlog2/cephlogs relatime off default
rbdlog2/cephlogs redundant_metadata all default
rbdlog2/cephlogs overlay off default

zpool get all
NAME PROPERTY VALUE SOURCE
rbdlog2 size 1016G -
rbdlog2 capacity 0% -
rbdlog2 altroot - default
rbdlog2 health ONLINE -
rbdlog2 guid 12884943537457662683 default
rbdlog2 version - default
rbdlog2 bootfs - default
rbdlog2 delegation on default
rbdlog2 autoreplace off default
rbdlog2 cachefile - default
rbdlog2 failmode wait default
rbdlog2 listsnapshots off default
rbdlog2 autoexpand off default
rbdlog2 dedupditto 0 default
rbdlog2 dedupratio 1.00x -
rbdlog2 free 1011G -
rbdlog2 allocated 4.63G -
rbdlog2 readonly off -
rbdlog2 ashift 13 local
rbdlog2 comment - default
rbdlog2 expandsize - -
rbdlog2 freeing 0 default
rbdlog2 fragmentation 0% -
rbdlog2 leaked 0 default
rbdlog2 feature@async_destroy enabled local
rbdlog2 feature@empty_bpobj active local
rbdlog2 feature@lz4_compress active local
rbdlog2 feature@spacemap_histogram active local
rbdlog2 feature@enabled_txg active local
rbdlog2 feature@hole_birth active local
rbdlog2 feature@extensible_dataset enabled local
rbdlog2 feature@embedded_data active local
rbdlog2 feature@bookmarks enabled local

Some settings are result of my tweaking and can be changed back

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Difficulty - Medium labels Apr 24, 2015
@GregorKopka
Copy link
Contributor

With primarycache=metadata you might suffer read/modify/write issues.
How do you write to the dataset (block size)?

Could you try with a fresh dataset using the defaults for:
primarycache (all)
recordsize (128K)
checksum (on)
logbias (latency)

@alitvak69
Copy link
Author

This was tried and didn't make difference, however

Our engineer spent 12 hours researching the topic

Bumping the parameters below helped on 0.6.4. I think he increased them 10 times, but one needs to play with it as mileage varies.

zfs_vdev_max_active
zfs_vdev_sync_write_min_active
zfs_vdev_async_write_max_active
zfs_vdev_sync_write_max_active
zfs_vdev_async_write_min_active

I hope it helps some one

@alitvak69
Copy link
Author

Returning to the issue in hope some one will have time to respond. With 6.5.5 settings below speed up writing but reading is horribly slow over 10 Gb network

zfs_vdev_max_active
zfs_vdev_sync_write_min_active
zfs_vdev_async_write_max_active
zfs_vdev_sync_write_max_active
zfs_vdev_async_write_min_active

It looks like reading speed going down corresponds to rbd block device utilization going to100%. I am only doing copy or rsync to a local drive, nothing else is accessing partition with zfs on top of rbd block device.

Does anyone have a clue on where to start looking?

@gmelikov
Copy link
Member

Close as stale.

If it's actual - feel free to reopen.

@gmelikov
Copy link
Member

Looks like the problem persists http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-February/030543.html , reopened.

@happycouak
Copy link

It seems something goes wrong during previous benchmarks as bumping below parameters (in order of x10 default values) actually improve significantly sequential workloads.

zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_max_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active

The thing is I don't have any visibility about potentials downside of bumping those parameters, so YMMV.

@behlendorf
Copy link
Contributor

@dafresh the default values were experimentally determined to give good performance for pools constructed from hdd/ssd devices. It's entirely possible that these aren't the best values for rbd devices which may have very different performance characteristics. It would be great if you could post what values do work well for you so we could document those recommented tunning. The default values are:

/*
 * The maximum number of i/os active to each device.  Ideally, this will be >=
 * the sum of each queue's max_active.  It must be at least the sum of each
 * queue's min_active.
 */
uint32_t zfs_vdev_max_active = 1000;

/*
 * Per-queue limits on the number of i/os active to each device.  If the
 * number of active i/os is < zfs_vdev_max_active, then the min_active comes
 * into play. We will send min_active from each queue, and then select from
 * queues in the order defined by zio_priority_t.
 *
 * In general, smaller max_active's will lead to lower latency of synchronous
 * operations.  Larger max_active's may lead to higher overall throughput,
 * depending on underlying storage.
 *
 * The ratio of the queues' max_actives determines the balance of performance
 * between reads, writes, and scrubs.  E.g., increasing
 * zfs_vdev_scrub_max_active will cause the scrub or resilver to complete
 * more quickly, but reads and writes to have higher latency and lower
 * throughput.
 */
uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 2;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

@tdb
Copy link

tdb commented Feb 28, 2018

I've had some success with these values. At least, they were an improvement over the defaults. I haven't spent a lot of time tuning or testing.

# defaults given at the end of the line
options zfs zfs_max_recordsize=4194304          # 1048576
options zfs zfs_vdev_async_read_max_active=18   # 3
options zfs zfs_vdev_async_write_max_active=60  # 10
options zfs zfs_vdev_scrub_max_active=12        # 2
options zfs zfs_vdev_sync_read_max_active=60    # 10
options zfs zfs_vdev_sync_write_max_active=60   # 10

@richardelling
Copy link
Contributor

The above are (mostly) ZIO scheduler tunables. More likely you need to adjust the write throttle tunables, as documented in the zfs-module-parameters man page section ZFS TRANSACTION DELAY.

But first... check the latency distribution from zpool iostat -w and see if there are outliers in the high latency buckets. If you don't have a version with the -w option, then you might try an external tool for measuring I/O latency, such as iolatency https://github.com/brendangregg/perf-tools/blob/master/examples/iolatency_example.txt

@happycouak
Copy link

@behlendorf RBD volumes latency behave differently than traditional hard drives, so this is what I ended to think too. Please find below parameters and values that give me much better performances:

options zfs zfs_vdev_max_active=10000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=30
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=100
options zfs zfs_vdev_async_write_min_active=10

I will try more benchmarks, also with @tdb values, when I got some time. Note that I am not a big fan to diverge from default settings, more specifically with ZFS which as some internals magics, but it seems unavoidable in this context.

@richardelling by reading the ZFS TRANSACTION DELAY section and if my understanding is good, "zfs_delay_scale" alone would suffice to achieve the same behavior (except maybe for zfs_max_active) than previous tuning ?

@richardelling
Copy link
Contributor

To determine if zfs_vdev___max_active is appropriately set, use zpool iostat -q [interval] during your workload. If the number of "activ" I/Os is capped at the _max_active and the "pend" > "activ" for extended periods of time, then increasing _max_active allows more I/Os to be in-flight to the device.

Ultimately, performance is determined by the latency of the I/Os. So if the bandwidth-delay-product of the network (to/from RBD) is low, then injecting more concurrent I/Os (increasing *_max_active) can make more efficient use of the network. But this typically has no effect on the internal latency of the RBD device. This is why we use zpool iostat -w to see the latency distribution first, then look at efficiency of the network with zpool iostat -q second.

For the write throttle, as described in ZFS TRANSACTION DELAY, a good starting point is zfs_dirty_data_max If increasing this value doesn't improve write performance, then the throttle likely isn't an issue. If increasing does improve performance, then the other related tunables come into play. The goal of the write throttle is to prevent performance from "falling off the cliff" when the device cannot quickly absorb the writes. For example, a device with a small write cache can perform very slowly when the write cache is filled, so the write throttle tries to apply backward pressure on the write workload generator to prevent cliffhangers.

So the basic approach is to determine if you need more or less concurrent I/O in the pipeline (*_max_active) vs more or less write throttling. Sometimes you need to tune both.

@npdgm
Copy link

npdgm commented Aug 17, 2018

@richardelling, on the same systems with RBD vdev described by @dafresh, we're still getting disappointing performance for a 99% read workload. Despite all attempts at tuning zfs_vdev_*_active parameters. Your procedure on determining appropriate values using zpool iostat makes much sense, I understand the logic of it, but looking at all the metrics it appears something else in ZFS is preventing concurrent I/Os.

zpool iostat -q reveals we barely register any "pending" request with a sub-second interval. Going up will even hide active reads :

# zpool iostat -q 0.2
              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
pool1        357T  56.7T    297      4  33.5M  19.8K      0     24      0      0      0      0      0      0      0      0
pool1        357T  56.7T    421      4  49.7M  19.8K      0     64      0      0      0      0      0      0      0      0
pool1        357T  56.7T    426      4  50.4M  19.8K      0     34      0      0      0      0      0      0      0      0

# zpool iostat -q 2
              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
pool1        357T  56.7T    220     32  22.8M   838K      0      0      0      0      0      0      0      0      0      0
pool1        357T  56.7T    244     91  26.8M   735K      0      2      0      0      0      0      0      0      0      0
pool1        357T  56.7T    213      4  22.6M  20.0K      0      0      0      0      0      0      0      0      0      0

iostat is consistent with that lack of concurrency. As you can see avgqu-sz is kept low by ZFS while handling lots of concurrent random reads.

# iostat -xm 2 vdb vdc
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdb               0.00     0.00  102.00    1.50  9812.00     6.00   189.72     2.19   21.16   21.47    0.00   7.38  76.40
vdc               0.00     0.00  112.00    1.00 10882.00     4.00   192.67     2.35   20.99   21.18    0.00   7.45  84.20

You may find await and %util quite high here, but this CEPH backend can deliver good throughput with more in-flight requests. This was confirmed with fio benchmarks and other filesystems.

So because zfs_vdev_*_active didn't increase concurrency over the block device, I'm left with two leads:

  • Can we make ZFS rely more on the block device scheduler and feed it's pipeline? I read it's prevented by design for I/O prioritization, but it hurts RBD performance. Is this behaviour fixed or is there a tunable related?
  • As seen on zpool iostat -q, all I/O are synchronous. It's not caused by the workload (NFS async export). I've found b39c22b, and Fix synchronous behavior in __vdev_disk_physio() #3833 from @behlendorf (393ee23) which suggest READ_SYNC is enforced for non-rotationnal devices. RDB/virtio devices do have the queue/rotational flag set to 1 and so I believe it's not affecting them. Could be wrong also.

Do you have any insight on what can be changed to push as much read requests as possible to RBD ?
It's all about read performance for files in the range of 0.5 to 10 times the recordsize = 128k.

Cheers

@npdgm
Copy link

npdgm commented Aug 20, 2018

Adding a few printk confirms that vdev_nonrot=0 on all RBD devices.
Also I didn't mention that sync=disabled would not change the behaviour. Copying a file will issue sync reads only, although write from that copy are async.

@peterska
Copy link

peterska commented Mar 30, 2019

I manged to get excellent sequential write performance by setting the following module parameters
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304

For these to work, the large_blocks zpool feature must be enabled.
This improves sequential reads and writes on ceph backed block devices by coalescing read and writes into 4MB blocks, which is the Ceph block size. This avoids a lot of read, modify write ops on the ceph side.

@prghix
Copy link

prghix commented Jul 2, 2019

Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:

/etc/modprobe.d/zfs.conf

options zfs zfs_vdev_max_active=40000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=20000
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=20000
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304

@gmelikov gmelikov reopened this Jul 2, 2019
@gmelikov gmelikov added the Type: Documentation Indicates a requested change to the documentation label Jul 2, 2019
@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 26, 2020
@arthurd2
Copy link

No news about this?

@behlendorf behlendorf reopened this Nov 26, 2020
@stale stale bot removed the Status: Stale No recent activity for issue label Nov 26, 2020
@ozkangoksu
Copy link

ozkangoksu commented Jun 5, 2021

Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:

/etc/modprobe.d/zfs.conf

options zfs zfs_vdev_max_active=40000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=20000
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=20000
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304

These options are doubled the performance. Pending operations and latency decreased.

TEST: 4GB tar ball write operation:
ZFS RAID 1 SATA SSD: 0m9.9s
TUNED ZFS RBD: 0m17.7s
NON-TUNED ZFS RBD: 0m34.5s

But these tunes messed up Random RW performance.
I'm still trying to implement further.
Do you have any advice?

@prghix
Copy link

prghix commented Jun 9, 2021

We're still using these values back from 2019 without any changes :/

@stale
Copy link

stale bot commented Jun 10, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jun 10, 2022
@behlendorf behlendorf added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Jun 14, 2022
@arthurd2
Copy link

We have abandon ZFS+RBD for now, changing our hosts to XFS.
But from time to time we come back to checkup the updates.

@ozkangoksu , do you have a script to make these tests?
I can replicate here and compare.

@ozkangoksu
Copy link

@arthurd2 Unfortunately it has been long time and I don't store fio scripts. I write for the use cases. I don't remember did I shared with the maillist while sharing the benchmark results. I will check when I got a free time.

We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.

@prghix
Copy link

prghix commented Mar 1, 2023

We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.

I used to have ~20-30 MBps throughput on Mimic.

Now we have several Pacific/Quincy clusters and I'm on like a 1/10 of former throughput.

3MBs/sec are awful... tried to tune everything according to the manual:

https://openzfs.org/wiki/ZFS_on_high_latency_devices

looks like I have working aggregation.. but throughput is PAINFULLY slow :/

@prghix
Copy link

prghix commented Mar 1, 2023

btw: I'm talking about saving ZFS snapshots (=backups) on RBD devices...

@serjponomarev
Copy link

@behlendorf

Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring zpool iostat -q 1, I noticed that the number of syncq_read during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and in zfs iostat, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.

If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in zfs iostat -r, and it was 32. To address this issue, I experimented with zvol, and it worked. When testing with a single thread and a depth of 32 on a raw zvol, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.

Next, I tested creating a zpool over a zvol, and I encountered the same issue with a depth limitation of 1. Then, I formatted the zvol to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.

In conclusion, using zvol resolves the problem for RBD because zvol facilitates request aggregation.
All tests were conducted with primarycache=metadata.

OS: Ubuntu 22.04
Ceph: Quincy, 17.2.6
ZFS: 2.1.5

How can this experience be applied to ZFS over RBD without using zvol?
If not possible, how safe is it to use large-sized zvols, for example, 10-20-30 TB?
Which file system is best for a large zvol?

@serjponomarev
Copy link

serjponomarev commented Dec 18, 2023

@behlendorf

Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring zpool iostat -q 1, I noticed that the number of syncq_read during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and in zfs iostat, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.

If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in zfs iostat -r, and it was 32. To address this issue, I experimented with zvol, and it worked. When testing with a single thread and a depth of 32 on a raw zvol, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.

Next, I tested creating a zpool over a zvol, and I encountered the same issue with a depth limitation of 1. Then, I formatted the zvol to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.

In conclusion, using zvol resolves the problem for RBD because zvol facilitates request aggregation. All tests were conducted with primarycache=metadata.

OS: Ubuntu 22.04 Ceph: Quincy, 17.2.6 ZFS: 2.1.5

How can this experience be applied to ZFS over RBD without using zvol? If not possible, how safe is it to use large-sized zvols, for example, 10-20-30 TB? Which file system is best for a large zvol?

In general, if anyone is having problems with poor performance in ZFS via RBD, then you most likely have 1 thread 1 IOPS performance (regardless of the query depth you set).

Performance on such a configuration scales with threads. To address this issue, you either need to use more threads or aggregate the depth of the requests. I found two solutions for aggregating depth:

  1. zvol: The depth is regulated by the number of threads /sys/module/zfs/parameters/zvol_threads (default is 32).
  2. nfs server: By default, the number of processes in most distributions is 8. When you change the number of nfsd processes, the number of IOPS increases due to the increase in the number of nfsd threads. For local use, mount NFS via the loopback interface.
    For most distributions, the parameters for the number of nfs server processes are located in /etc/default.
    For Ubuntu 22.04 and higher, read this manual.

For zvol by default (32 threads) and nfs server (32 processes)
with these parameters, I got maximum performance:

options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_aggregation_limit=1048576
options zfs zfs_vdev_aggregation_limit_non_rotating=1048576
options zfs zfs_dirty_data_max=1342177280

Description Module Options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests