-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LTO + PGO + Bolt #5048
Comments
Pls forgive my ignorance, but aren't these tools optimizing based on profiling? With rsyslog, we have severely different workloads, and thus different frequently executed codepaths. There is no "prototype scenario" which we could profile for. So is this really a tool for universal optimization - or more something an end user may run to fine-tune specific needs when building rsyslog? |
Yes, PGO and BOLT are based on profiling. However, LTO does not require a profiling step - that is just a link-time optimization.
That is a good question. Even if rsyslog has different components, usually there are not so many "happy paths" which are could be considered a default paths. E.g. let's consider Rust compiler (rustc) example. Even if it compiles in real-life very different programs, it benefits a lot from the profile (or profile set) which is prepared by Rust team and PGO gives a huge compile-time boost. Another idea is suggested here. Since rsyslog has many components, it could be a good idea to prepare a "happy path" workload for each of the components. Then, PGO has a built-in ability to merge multiple profiles into one big profile and optimize the program according to the multiple workloads. Even if you use multiple profiles, it still could be beneficial since you will optimize for happy paths from real-life. Without profiling information nowadays compilers cannot guess well between "hot" and "cold" branches.
There are multiple ways, how you could use profiling information with PGO and BOLT. The first one (more easier to implement in my opinion) is to integrate PGO and BOLT in some way into the existing rsyslog's build scripts and provide some guidelines in the repo something like "How to build your own rsyslog with PGO and BOLT". In this case, for the users/distro maintainers (if they wish) will be much easier to prepare their own baseline profile scenarios and build their own version of rsyslog, optimized especially for their own specific workloads. Unfortunately, in this case, the users will need to prepare profile scenarios, execute multi-stage builds, etc. The second is the preparation of a profile scenario (or a set of them). In this case, the user will be able to execute multi-stage build with provided by rsyslog scenarios. Yes, it could be a little bit less beneficial from the performance perspective, but it still could give a performance improvement. If the user will need even more performance boost - okay, they always could use the first way and prepare their own profile scenario. This way is more difficult since you need to prepare profile scenarios additionally to build scripts integration. |
One comment about this (as someone outside the fish bowl):
I tested LTO on Crypto++ and OpenSSL. LTO did not provide performance gains. For Crypto++ benchmarks, performance dropped, and the linker actually generated bad code. The bad code resulted in crashes and failed self tests. See https://www.cryptopp.com/wiki/Link_Time_Optimization . Also, I've never seen a paper or presentation showing the benefits of using LTO. That is, I would expect to see someone present a selection of projects, like including rsyslog, and show the actual before and after numbers to substantiate the performance claims. My takeaway was, LTO was a solution looking for a problem. We advise against using LTO for programs built using the Crypto++ library. Anyway, I don't have a dog in this fight. I just wanted to share my experience and thoughts with the team before they spun up a task on this. |
There are multiple showings that LTO could help with improving performance: Clangd report from Jetbrains, Phoronix bench on GCC 10 (note that this page also shows performance decrease for some projects), rustc compiler. I can find more but there is no need for this - LTO can improve performance. However, it's not guaranteed, so as usual - depends on the project/chosen tooling/ etc. Regarding the performance decrease in your case - did you report it to the upstream? What versions of toolchains did you use? Are these results applicable to the latest toolchain versions? Regarding bugs. I agree that LTO sometimes causes bugs, e.g. check this repo. From my experience, LTO uncovers UB and they are striking a leg with a new power. But that's not an LTO fault - that's a fault of a program with UB. Anyway, some quirks with LTO implementation could still be found in the tooling so using a modern toolchain would be a good advice here. About LTO state across OS distros - you can check here. My point is LTO is worth trying with |
I would tend to say that this advise is best for distro package maintainers. While we build packages, the far majority of folks use the distro provided ones. I admit I am conservative on build tools. For example, we tried jmalloc in the past. Good performance, but we found a couple of definite jmalloc bugs, probably fixed now, but we decided to keep the old allocator in favor of robustness. Everyone is free to rebuild. Build toolchain is far from our core competency, the team is small and as you can see there is a lot of work. While I appreciate the idea, I have to say that there are far more important things in front of it in the pipeline (think: better algo always beats optimizer). |
As a side, note, I would not object if @zamazan4ik rebuilds rsyslog with the tools. I am also open to a PR, if required, that makes it easier to integrate the tools. Just make that functionality optional by a ./configure switch, so that it can be turned on and off. :-) |
I did some PGO benchmarks with Test environment
BenchmarkIn general, the benchmark methodology is the same as described in the corresponding Vector issue (link) and Fluent Bit issut (link). That's a near real-life log scenario from our production (just Elasticsearch in the end instead of Configuration files
Sorry for probably an awful config file - I am not an expert in Rsyslog configs at all :) Rsyslog was started with a Tested configurationsI have tested the following Rsyslog configurations (with corresponding
ResultsHere are the results of running the benchmark on different configurations. All configurations are benchmarked on the same machine, with the same rsyslog configuration, multiple times, etc. The results show how much time rsyslog needs to process logs (the source log file is the same for all runs). I have rechecked - the results are consistent between runs.
At least in this scenario, rsyslog performs better (=faster) with PGO. Not bad for "just" a compiler option :) More results with PGO on real-life applications you can find here. There you also can check positive LTO and PGO results for Fluent-Bit and Vector. For rsyslog I didn't check LTO yet. |
@rgerhards Probably you are interested in the results above (I don't know do you receive notifications about updates in the discussion threads so I politely ping you explicitly here). |
@zamazan4ik I have not much to add to what I already said: feel free to craft a PR with changes, just make sure that the optimizations can be controlled via a configure switch. Keep them turned off by default. I will not do anything myself as I think I can spent the time much better from an overall community PoV. |
@rgerhards what do you think about writing a guide about "How to build Rsyslog with PGO?" somewhere in the Rsyslog documentation? In this guide, we can show the users, how PGO could be useful for their scenarios (at least according to my tests above PGO helps with achieving measurable performance boost for Rsyslog), and prepare a step-by-step guide on how to build Rsyslog with PGO? Here are examples of such guides in other projects:
If you are not against of such a guide, I am ready to contribute it to the project. We need just to negotiate, where the best place to put this guide in the Rsyslog documentation. Also, it would be helpful if you could tell me how to properly contribute the documentation changes to the project. |
We are never against additional documentation, the only concern is if we start
getting asked to support it. But as long as you remain active and respond to
questions, go for it.
In terms of changing documentation, there is a git repo of the docs, just submit
a PR with your changes.
David Lang
On Sat, 26 Aug 2023, Alexander Zaitsev wrote:
… @rgerhards what do you think about writing a guide about "How to build Rsyslog with PGO?" somewhere in the Rsyslog documentation? In this guide, we can show the users, how PGO could be useful for their scenarios (at least according to my tests above PGO helps with achieving measurable performance boost for Rsyslog), and prepare a step-by-step guide on how to build Rsyslog with PGO?
Here are examples of such guides in other projects:
* Vector: https://vector.dev/docs/administration/tuning/pgo/ (as an example of the logging-specific solution)
* GCC: Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported)
* Clang: https://llvm.org/docs/HowToBuildWithPGO.html
* ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
* Databend: https://databend.rs/doc/contributing/pgo
* Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
If you are not against of such a guide, I am ready to contribute it to the project. We need just to negotiate, where the best place to put this guide in the Rsyslog documentation. Also, it would be helpful if you could tell me how to properly contribute the documentation changes to the project.
|
Hi!
rsyslog right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".
Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.
Also, there are some caveats to consider like:
Links:
The text was updated successfully, but these errors were encountered: