LTO + PGO + Bolt #5048

zamazan4ik · 2022-12-23T10:02:53Z

Hi!

rsyslog right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".

Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.

Also, there are some caveats to consider like:

Increased build times
BOLT could be still unstable (or even broken) on some architectures

Links:

Vector (similar to rsyslog software) results: PGO applicability to Vector vectordotdev/vector#15631
Rust experience with LTO + PGO + BOLT: https://kobzol.github.io/rust/rustc/2022/10/27/speeding-rustc-without-changing-its-code.html
Good chance to optimize build times of the project with PGO too: Compile times can be significantly reduced by optimizing the compiler scylladb/scylladb#10985

rgerhards · 2022-12-23T14:34:21Z

Pls forgive my ignorance, but aren't these tools optimizing based on profiling?

With rsyslog, we have severely different workloads, and thus different frequently executed codepaths. There is no "prototype scenario" which we could profile for. So is this really a tool for universal optimization - or more something an end user may run to fine-tune specific needs when building rsyslog?

zamazan4ik · 2022-12-23T16:51:28Z

Pls forgive my ignorance, but aren't these tools optimizing based on profiling?

Yes, PGO and BOLT are based on profiling. However, LTO does not require a profiling step - that is just a link-time optimization.

With rsyslog, we have severely different workloads, and thus different frequently executed codepaths. There is no "prototype scenario" which we could profile for.

That is a good question. Even if rsyslog has different components, usually there are not so many "happy paths" which are could be considered a default paths. E.g. let's consider Rust compiler (rustc) example. Even if it compiles in real-life very different programs, it benefits a lot from the profile (or profile set) which is prepared by Rust team and PGO gives a huge compile-time boost.

Another idea is suggested here. Since rsyslog has many components, it could be a good idea to prepare a "happy path" workload for each of the components. Then, PGO has a built-in ability to merge multiple profiles into one big profile and optimize the program according to the multiple workloads. Even if you use multiple profiles, it still could be beneficial since you will optimize for happy paths from real-life. Without profiling information nowadays compilers cannot guess well between "hot" and "cold" branches.

So is this really a tool for universal optimization - or more something an end user may run to fine-tune specific needs when building rsyslog?

There are multiple ways, how you could use profiling information with PGO and BOLT.

The first one (more easier to implement in my opinion) is to integrate PGO and BOLT in some way into the existing rsyslog's build scripts and provide some guidelines in the repo something like "How to build your own rsyslog with PGO and BOLT". In this case, for the users/distro maintainers (if they wish) will be much easier to prepare their own baseline profile scenarios and build their own version of rsyslog, optimized especially for their own specific workloads. Unfortunately, in this case, the users will need to prepare profile scenarios, execute multi-stage builds, etc.

The second is the preparation of a profile scenario (or a set of them). In this case, the user will be able to execute multi-stage build with provided by rsyslog scenarios. Yes, it could be a little bit less beneficial from the performance perspective, but it still could give a performance improvement. If the user will need even more performance boost - okay, they always could use the first way and prepare their own profile scenario. This way is more difficult since you need to prepare profile scenarios additionally to build scripts integration.

noloader · 2023-06-28T14:21:46Z

@zamazan4ik,

One comment about this (as someone outside the fish bowl):

I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not.

I tested LTO on Crypto++ and OpenSSL. LTO did not provide performance gains. For Crypto++ benchmarks, performance dropped, and the linker actually generated bad code. The bad code resulted in crashes and failed self tests. See https://www.cryptopp.com/wiki/Link_Time_Optimization .

Also, I've never seen a paper or presentation showing the benefits of using LTO. That is, I would expect to see someone present a selection of projects, like including rsyslog, and show the actual before and after numbers to substantiate the performance claims.

My takeaway was, LTO was a solution looking for a problem. We advise against using LTO for programs built using the Crypto++ library.

Anyway, I don't have a dog in this fight. I just wanted to share my experience and thoughts with the team before they spun up a task on this.

zamazan4ik · 2023-06-28T15:22:43Z

Also, I've never seen a paper or presentation showing the benefits of using LTO. That is, I would expect to see someone present a selection of projects, like including rsyslog, and show the actual before and after numbers to substantiate the performance claims.

There are multiple showings that LTO could help with improving performance: Clangd report from Jetbrains, Phoronix bench on GCC 10 (note that this page also shows performance decrease for some projects), rustc compiler. I can find more but there is no need for this - LTO can improve performance. However, it's not guaranteed, so as usual - depends on the project/chosen tooling/ etc. Regarding the performance decrease in your case - did you report it to the upstream? What versions of toolchains did you use? Are these results applicable to the latest toolchain versions?

Regarding bugs. I agree that LTO sometimes causes bugs, e.g. check this repo. From my experience, LTO uncovers UB and they are striking a leg with a new power. But that's not an LTO fault - that's a fault of a program with UB. Anyway, some quirks with LTO implementation could still be found in the tooling so using a modern toolchain would be a good advice here. About LTO state across OS distros - you can check here.

My point is LTO is worth trying with rsyslog but I expect most performance gains from PGO (since at least for Vector (a log shipper) PGO shown pretty good results).

rgerhards · 2023-06-28T15:37:45Z

I would tend to say that this advise is best for distro package maintainers. While we build packages, the far majority of folks use the distro provided ones. I admit I am conservative on build tools. For example, we tried jmalloc in the past. Good performance, but we found a couple of definite jmalloc bugs, probably fixed now, but we decided to keep the old allocator in favor of robustness.

Everyone is free to rebuild. Build toolchain is far from our core competency, the team is small and as you can see there is a lot of work. While I appreciate the idea, I have to say that there are far more important things in front of it in the pipeline (think: better algo always beats optimizer).

rgerhards · 2023-06-28T15:39:35Z

As a side, note, I would not object if @zamazan4ik rebuilds rsyslog with the tools. I am also open to a PR, if required, that makes it easier to integrate the tools. Just make that functionality optional by a ./configure switch, so that it can be turned on and off. :-)

zamazan4ik · 2023-07-12T03:51:50Z

I did some PGO benchmarks with rsyslog and want to share my results.

Test environment

Fedora 38
Linux kernel 6.3.11
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Clang 16 (from the Fedora repositories). I use Clang just because I prefer LLVM-based tooling
Rsyslog version: from master branch (commit 7f4999f1087c7ca86c3ce49e677c3c461db42f88)

Benchmark

In general, the benchmark methodology is the same as described in the corresponding Vector issue (link) and Fluent Bit issut (link). That's a near real-life log scenario from our production (just Elasticsearch in the end instead of /dev/null). As the source of logs, I used one file with logs in the required format with a size ~7.9 Gib.

Configuration files

rsyslog.conf:

# rsyslog configuration file

# Global directives
global(workDirectory="/home/zamazan4ik/open_source/bench_rsyslog/workdir")

# Module loading
module(load="imfile")     # Input module for reading logs from files

# Input configuration
input(type="imfile"
    File="/home/zamazan4ik/open_source/bench_rsyslog/logs/test.log"  # Path to your log file
    Tag="tag_for_log"                  # Optional: Set a tag for your logs
)

# Parsing rules
if re_match($msg, "/<(<level>[EWD]) (<thread>.+?) (<tag>[a-z.]+) (<datetime>[\\d.]+ [\\d:]*) (<function>[\\S]+) (<mess>.*) \\(from (<file>[\\S.]*) \\+(<line>\\d+)\\)/") then {
    action(type="omfile"
           File="/dev/null")

    stop
}

stop

Sorry for probably an awful config file - I am not an expert in Rsyslog configs at all :)

Rsyslog was started with a taskset -c 1-2 rsyslogd -f rsyslog.conf -iNONE command each time. Between runs the processes were killed, and file states were reset.

Tested configurations

I have tested the following Rsyslog configurations (with corresponding CFLAGS):

Release: CC=clang CFLAGS="-O3" LDFLAGS="-O3" ./autogen.sh --disable-generate-man-pages --prefix=/home/zamazan4ik/open_source/install_rsyslog_optimized --enable-imfile
Release + PGO: CC=clang CFLAGS="-O3 -fprofile-instr-use=rsyslog.profdata" LDFLAGS="-O3 -fprofile-instr-use=rsyslog.profdata" ./autogen.sh --disable-generate-man-pages --prefix=/home/zamazan4ik/open_source/install_rsyslog_optimized --enable-imfile

Results

Here are the results of running the benchmark on different configurations. All configurations are benchmarked on the same machine, with the same rsyslog configuration, multiple times, etc. The results show how much time rsyslog needs to process logs (the source log file is the same for all runs). I have rechecked - the results are consistent between runs.

Release: ~1m45s
Release + PGO: ~1m34s

At least in this scenario, rsyslog performs better (=faster) with PGO. Not bad for "just" a compiler option :)

More results with PGO on real-life applications you can find here. There you also can check positive LTO and PGO results for Fluent-Bit and Vector. For rsyslog I didn't check LTO yet.

zamazan4ik · 2023-07-19T02:13:52Z

@rgerhards Probably you are interested in the results above (I don't know do you receive notifications about updates in the discussion threads so I politely ping you explicitly here).

rgerhards · 2023-07-19T06:49:02Z

@zamazan4ik I have not much to add to what I already said: feel free to craft a PR with changes, just make sure that the optimizations can be controlled via a configure switch. Keep them turned off by default. I will not do anything myself as I think I can spent the time much better from an overall community PoV.

zamazan4ik · 2023-08-26T14:09:58Z

@rgerhards what do you think about writing a guide about "How to build Rsyslog with PGO?" somewhere in the Rsyslog documentation? In this guide, we can show the users, how PGO could be useful for their scenarios (at least according to my tests above PGO helps with achieving measurable performance boost for Rsyslog), and prepare a step-by-step guide on how to build Rsyslog with PGO?

Here are examples of such guides in other projects:

Vector: https://vector.dev/docs/administration/tuning/pgo/ (as an example of the logging-specific solution)
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang: https://llvm.org/docs/HowToBuildWithPGO.html
ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/

If you are not against of such a guide, I am ready to contribute it to the project. We need just to negotiate, where the best place to put this guide in the Rsyslog documentation. Also, it would be helpful if you could tell me how to properly contribute the documentation changes to the project.

davidelang · 2023-08-27T00:42:40Z

We are never against additional documentation, the only concern is if we start getting asked to support it. But as long as you remain active and respond to questions, go for it. In terms of changing documentation, there is a git repo of the docs, just submit a PR with your changes. David Lang On Sat, 26 Aug 2023, Alexander Zaitsev wrote:

…

@rgerhards what do you think about writing a guide about "How to build Rsyslog with PGO?" somewhere in the Rsyslog documentation? In this guide, we can show the users, how PGO could be useful for their scenarios (at least according to my tests above PGO helps with achieving measurable performance boost for Rsyslog), and prepare a step-by-step guide on how to build Rsyslog with PGO? Here are examples of such guides in other projects: * Vector: https://vector.dev/docs/administration/tuning/pgo/ (as an example of the logging-specific solution) * GCC: Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported) * Clang: https://llvm.org/docs/HowToBuildWithPGO.html * ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization * Databend: https://databend.rs/doc/contributing/pgo * Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/ If you are not against of such a guide, I am ready to contribute it to the project. We need just to negotiate, where the best place to put this guide in the Rsyslog documentation. Also, it would be helpful if you could tell me how to properly contribute the documentation changes to the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTO + PGO + Bolt #5048

LTO + PGO + Bolt #5048

zamazan4ik commented Dec 23, 2022

rgerhards commented Dec 23, 2022

zamazan4ik commented Dec 23, 2022

noloader commented Jun 28, 2023 •

edited

zamazan4ik commented Jun 28, 2023

rgerhards commented Jun 28, 2023

rgerhards commented Jun 28, 2023

zamazan4ik commented Jul 12, 2023 •

edited

zamazan4ik commented Jul 19, 2023

rgerhards commented Jul 19, 2023

zamazan4ik commented Aug 26, 2023

davidelang commented Aug 27, 2023 via email

LTO + PGO + Bolt #5048

LTO + PGO + Bolt #5048

Comments

zamazan4ik commented Dec 23, 2022

rgerhards commented Dec 23, 2022

zamazan4ik commented Dec 23, 2022

noloader commented Jun 28, 2023 • edited

zamazan4ik commented Jun 28, 2023

rgerhards commented Jun 28, 2023

rgerhards commented Jun 28, 2023

zamazan4ik commented Jul 12, 2023 • edited

Test environment

Benchmark

Configuration files

Tested configurations

Results

zamazan4ik commented Jul 19, 2023

rgerhards commented Jul 19, 2023

zamazan4ik commented Aug 26, 2023

davidelang commented Aug 27, 2023 via email

noloader commented Jun 28, 2023 •

edited

zamazan4ik commented Jul 12, 2023 •

edited