New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC throughput is halved by init_on_alloc #9910
Comments
Might help to have a bit more information like the contents of
Or atleast arc_max and output from arcstat.py |
Isn't it a known issue ? |
The 5.3 linux kernel adds a new feature which allows pages to be zeroed when allocating or freeing them: init_on_alloc and init_on_free. init_on_alloc is enabled by default on Ubuntu 18.04 HWE kernel. ZFS allocates and frees pages frequently (via the ABD structure), e.g. for every disk access. The additional overhead of zeroing these pages is significant. I measured a ~40% regression in performance of an uncached "zfs send ... >/dev/null". This new "feature" can be disabled by setting init_on_alloc=0 in the GRUB kernel boot parameters, which undoes the performance regression. Linux kernel commit: torvalds/linux@6471384 |
Would it be insane to recycle these pages rather than continually alloc-and-freeing them, at least while ABD requests are 'hot'? I could look into this if the idea isn't struck down immediately. |
We could. The challenge would be to not reintroduce the problems that ABD solved (locking down excess memory). I think it would be possible, but obviously much easier to just turn off this new kernel feature - which I will suggest to the folks at Canonical. |
Sincerely sorry for stating the obvious but there are quite substantial security benefits to IMHO, it would be helpful to devise a workaround (or perhaps guideline ARC configuration options pending a code fix?) that would mitigate the perf impact on those who value both ZFS and security. Can anyone chip in? Also, I don't think that the issue is Canonical-specific; recommendations (such as this) exist that suggest turning this feature on by default - for good reasons IMHO. I happen to use Ubuntu but it would be good to hear from ppl using other distros as to the perf impact on ZFS. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Is there still no workaround other as disabling? |
@behlendorf @amotin Replicating the OP's test on Linux 6.5 with current OpenZFS master branch, it appears that the difference for init_on_alloc=[0|1] is negligible. Suggest this be closed, provided there are no objections. script:
init_on_alloc=0:
init_on_alloc=1:
|
@ednadolski-ix It becomes ANYTHING BUT negligible as soon as memory bandwidth reaches saturation, AKA application traffic reaches 10-20% of it. The test you are using is just inadequate. Single write stream from /dev/urandom or even ARC read into /dev/null is just unable to saturate anything, they both are limited by single core speed at best. We've disabled it in TrueNAS SCALE (truenas/linux@d165d39) , and we did notice performance improvements. IMO it is a paranoid patch against brain dead programmers unable to properly initialize memory. The only question is whether ZFS can control it somehow for its own allocations or it has to be done by sane distributions. |
System information
Describe the problem you're observing
Bandwidth for reads from the ARC is approximately half of the bandwidth of reads from the native Linux cache when ABD scatter is enabled. ARC read speed is fairly comparable to the native Linux cache when ABD scatter is disabled (i.e. 'legacy' linear mode).
Describe how to reproduce the problem
On a 16GB system with ~12GB free:
$ head --bytes=8G < /dev/urandom > testfile
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time cp testfile /dev/null
$ time cp testfile /dev/null
$ time cp testfile /dev/null
...
Repeat for the scatter-on, scatter-off, and non-ZFS cases for comparison.
On my system, 8GB from a linear-ABD warm ARC takes 1.1 seconds, scatter-ABD takes 2.1 seconds, and 'native' takes 0.9 seconds.
(ARC also sometimes takes an arbitrarily high number of 'cp' repeats to get warm again in the presence of a lot of [even rather cold] native cache memory in use, hence the drop_caches above. I can file that as a separate issue if desirable.)
Apologies in advance if I cause blood to boil with the slipshod nature of this benchmark; I know it's only measuring one dimension of ARC performance (no writes, no latency measurements, no ARC<->L2ARC interactions) but it's a metric I was interested in at the time.
The text was updated successfully, but these errors were encountered: