New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS low throughput on rbd based vdev #3324
Comments
With primarycache=metadata you might suffer read/modify/write issues. Could you try with a fresh dataset using the defaults for: |
This was tried and didn't make difference, however Our engineer spent 12 hours researching the topic Bumping the parameters below helped on 0.6.4. I think he increased them 10 times, but one needs to play with it as mileage varies. zfs_vdev_max_active I hope it helps some one |
Returning to the issue in hope some one will have time to respond. With 6.5.5 settings below speed up writing but reading is horribly slow over 10 Gb network zfs_vdev_max_active It looks like reading speed going down corresponds to rbd block device utilization going to100%. I am only doing copy or rsync to a local drive, nothing else is accessing partition with zfs on top of rbd block device. Does anyone have a clue on where to start looking? |
Close as stale. If it's actual - feel free to reopen. |
Looks like the problem persists http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-February/030543.html , reopened. |
It seems something goes wrong during previous benchmarks as bumping below parameters (in order of x10 default values) actually improve significantly sequential workloads. zfs_vdev_async_read_max_active The thing is I don't have any visibility about potentials downside of bumping those parameters, so YMMV. |
@dafresh the default values were experimentally determined to give good performance for pools constructed from hdd/ssd devices. It's entirely possible that these aren't the best values for rbd devices which may have very different performance characteristics. It would be great if you could post what values do work well for you so we could document those recommented tunning. The default values are:
|
I've had some success with these values. At least, they were an improvement over the defaults. I haven't spent a lot of time tuning or testing. # defaults given at the end of the line options zfs zfs_max_recordsize=4194304 # 1048576 options zfs zfs_vdev_async_read_max_active=18 # 3 options zfs zfs_vdev_async_write_max_active=60 # 10 options zfs zfs_vdev_scrub_max_active=12 # 2 options zfs zfs_vdev_sync_read_max_active=60 # 10 options zfs zfs_vdev_sync_write_max_active=60 # 10 |
The above are (mostly) ZIO scheduler tunables. More likely you need to adjust the write throttle tunables, as documented in the zfs-module-parameters man page section ZFS TRANSACTION DELAY. But first... check the latency distribution from |
@behlendorf RBD volumes latency behave differently than traditional hard drives, so this is what I ended to think too. Please find below parameters and values that give me much better performances:
I will try more benchmarks, also with @tdb values, when I got some time. Note that I am not a big fan to diverge from default settings, more specifically with ZFS which as some internals magics, but it seems unavoidable in this context. @richardelling by reading the ZFS TRANSACTION DELAY section and if my understanding is good, "zfs_delay_scale" alone would suffice to achieve the same behavior (except maybe for zfs_max_active) than previous tuning ? |
To determine if zfs_vdev___max_active is appropriately set, use Ultimately, performance is determined by the latency of the I/Os. So if the bandwidth-delay-product of the network (to/from RBD) is low, then injecting more concurrent I/Os (increasing *_max_active) can make more efficient use of the network. But this typically has no effect on the internal latency of the RBD device. This is why we use For the write throttle, as described in ZFS TRANSACTION DELAY, a good starting point is So the basic approach is to determine if you need more or less concurrent I/O in the pipeline (*_max_active) vs more or less write throttling. Sometimes you need to tune both. |
@richardelling, on the same systems with RBD vdev described by @dafresh, we're still getting disappointing performance for a 99% read workload. Despite all attempts at tuning
You may find So because
Do you have any insight on what can be changed to push as much read requests as possible to RBD ? Cheers |
Adding a few |
I manged to get excellent sequential write performance by setting the following module parameters For these to work, the large_blocks zpool feature must be enabled. |
Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values: /etc/modprobe.d/zfs.conf
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
No news about this? |
These options are doubled the performance. Pending operations and latency decreased. TEST: 4GB tar ball write operation: But these tunes messed up Random RW performance. |
We're still using these values back from 2019 without any changes :/ |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
We have abandon ZFS+RBD for now, changing our hosts to XFS. @ozkangoksu , do you have a script to make these tests? |
@arthurd2 Unfortunately it has been long time and I don't store fio scripts. I write for the use cases. I don't remember did I shared with the maillist while sharing the benchmark results. I will check when I got a free time. We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS. |
I used to have ~20-30 MBps throughput on Mimic. Now we have several Pacific/Quincy clusters and I'm on like a 1/10 of former throughput. 3MBs/sec are awful... tried to tune everything according to the manual: https://openzfs.org/wiki/ZFS_on_high_latency_devices looks like I have working aggregation.. but throughput is PAINFULLY slow :/ |
btw: I'm talking about saving ZFS snapshots (=backups) on RBD devices... |
Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in Next, I tested creating a In conclusion, using OS: Ubuntu 22.04 How can this experience be applied to ZFS over RBD without using |
In general, if anyone is having problems with poor performance in ZFS via RBD, then you most likely have 1 thread 1 IOPS performance (regardless of the query depth you set). Performance on such a configuration scales with threads. To address this issue, you either need to use more threads or aggregate the depth of the requests. I found two solutions for aggregating depth:
For zvol by default (32 threads) and nfs server (32 processes)
|
Sorry for double post. I know fixing 0.6.4 takes priority over everything else, however I decided to post this question in hope that one of the developers will give me a hint where to start.
When testing our ceph cluster I found a very strange problem. When I create a zfs file system on the top of rbd dev /dev/rbd0, no matter what tweaks I do I cannot exceed 30 MB / Sec on 1 Gbit pipe. Set sync disabled has no effect. When I use xfs on the same device I come close to saturating 1 Gbit, i.e. I am writing at 109 MB / sec
I don't have compression enabled on zfs so I could see a real throughput.
Can some one help to explain this?
zfs get all rbdlog2/cephlogs
NAME PROPERTY VALUE SOURCE
rbdlog2/cephlogs type filesystem -
rbdlog2/cephlogs creation Sun Apr 19 9:46 2015 -
rbdlog2/cephlogs used 4.62G -
rbdlog2/cephlogs available 995G -
rbdlog2/cephlogs referenced 4.62G -
rbdlog2/cephlogs compressratio 1.00x -
rbdlog2/cephlogs mounted yes -
rbdlog2/cephlogs quota none default
rbdlog2/cephlogs reservation none default
rbdlog2/cephlogs recordsize 32K inherited from rbdlog2
rbdlog2/cephlogs mountpoint /cephlogs local
rbdlog2/cephlogs sharenfs off default
rbdlog2/cephlogs checksum fletcher4 inherited from rbdlog2
rbdlog2/cephlogs compression off default
rbdlog2/cephlogs atime off inherited from rbdlog2
rbdlog2/cephlogs devices on default
rbdlog2/cephlogs exec on default
rbdlog2/cephlogs setuid on default
rbdlog2/cephlogs readonly off default
rbdlog2/cephlogs zoned off default
rbdlog2/cephlogs snapdir hidden default
rbdlog2/cephlogs aclinherit restricted default
rbdlog2/cephlogs canmount on default
rbdlog2/cephlogs xattr sa inherited from rbdlog2
rbdlog2/cephlogs copies 1 default
rbdlog2/cephlogs version 5 -
rbdlog2/cephlogs utf8only off -
rbdlog2/cephlogs normalization none -
rbdlog2/cephlogs casesensitivity sensitive -
rbdlog2/cephlogs vscan off default
rbdlog2/cephlogs nbmand off default
rbdlog2/cephlogs sharesmb off default
rbdlog2/cephlogs refquota none default
rbdlog2/cephlogs refreservation none default
rbdlog2/cephlogs primarycache metadata local
rbdlog2/cephlogs secondarycache metadata inherited from rbdlog2
rbdlog2/cephlogs usedbysnapshots 0 -
rbdlog2/cephlogs usedbydataset 4.62G -
rbdlog2/cephlogs usedbychildren 0 -
rbdlog2/cephlogs usedbyrefreservation 0 -
rbdlog2/cephlogs logbias throughput local
rbdlog2/cephlogs dedup off default
rbdlog2/cephlogs mlslabel none default
rbdlog2/cephlogs sync disabled inherited from rbdlog2
rbdlog2/cephlogs refcompressratio 1.00x -
rbdlog2/cephlogs written 4.62G -
rbdlog2/cephlogs logicalused 4.62G -
rbdlog2/cephlogs logicalreferenced 4.62G -
rbdlog2/cephlogs snapdev hidden default
rbdlog2/cephlogs acltype off default
rbdlog2/cephlogs context none default
rbdlog2/cephlogs fscontext none default
rbdlog2/cephlogs defcontext none default
rbdlog2/cephlogs rootcontext none default
rbdlog2/cephlogs relatime off default
rbdlog2/cephlogs redundant_metadata all default
rbdlog2/cephlogs overlay off default
zpool get all
NAME PROPERTY VALUE SOURCE
rbdlog2 size 1016G -
rbdlog2 capacity 0% -
rbdlog2 altroot - default
rbdlog2 health ONLINE -
rbdlog2 guid 12884943537457662683 default
rbdlog2 version - default
rbdlog2 bootfs - default
rbdlog2 delegation on default
rbdlog2 autoreplace off default
rbdlog2 cachefile - default
rbdlog2 failmode wait default
rbdlog2 listsnapshots off default
rbdlog2 autoexpand off default
rbdlog2 dedupditto 0 default
rbdlog2 dedupratio 1.00x -
rbdlog2 free 1011G -
rbdlog2 allocated 4.63G -
rbdlog2 readonly off -
rbdlog2 ashift 13 local
rbdlog2 comment - default
rbdlog2 expandsize - -
rbdlog2 freeing 0 default
rbdlog2 fragmentation 0% -
rbdlog2 leaked 0 default
rbdlog2 feature@async_destroy enabled local
rbdlog2 feature@empty_bpobj active local
rbdlog2 feature@lz4_compress active local
rbdlog2 feature@spacemap_histogram active local
rbdlog2 feature@enabled_txg active local
rbdlog2 feature@hole_birth active local
rbdlog2 feature@extensible_dataset enabled local
rbdlog2 feature@embedded_data active local
rbdlog2 feature@bookmarks enabled local
Some settings are result of my tweaking and can be changed back
The text was updated successfully, but these errors were encountered: