Consider using LTO + PGO + Bolt #140

zamazan4ik · 2022-12-22T23:58:08Z

Hi!

YDB right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".

Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.

Also, there are some caveats to consider like:

Increased build times
BOLT could be still unstable (or even broken) on some architectures

Links:

ScyllaDB results: build: introduce LTO, PGO and BOLT to the build scylladb/scylladb#10808
Vector results: PGO applicability to Vector vectordotdev/vector#15631
Rust experience with LTO + PGO + BOLT: https://kobzol.github.io/rust/rustc/2022/10/27/speeding-rustc-without-changing-its-code.html
Good chance to optimize build times of the project with PGO too: Compile times can be significantly reduced by optimizing the compiler scylladb/scylladb#10985

zamazan4ik · 2023-03-25T23:11:01Z

I did some performance experiments on my local machine.

My setup:

OS: Fedora 37
Linux kernel: 6.2.7
Compiler: clang-15 from Fedora packages (clang 15.0.7 (I've patched a few sources to support this compiler)
Hardware: Ryzen 9 5900X, 32 Gib RAM, SSD

For benchmark purposes and profile generation, I've used KqpLoad actor (https://ydb.tech/en/docs/development/load-actors-kqp) which I've run multiple times for 300 seconds each time (all other parameters are default). YDB setup - local with RAM storage as described here: https://ydb.tech/en/docs/getting_started/self_hosted/ydb_local but with my own ydbd binaries.

I did the following things:

Build the usual release build and benchmark it
Build the instrumented build, run the same benchmark over it and then compile again with the generated profiles with Clang PGO

The results are the following:

Usual release build: 28k TPS
PGO-optimized build with the same release flags: 35k TPS

Also, I've tried to apply BOLT but perf2bolt consumes more than 32 Gib RAM for ydbd binary so it was OOM-killed :(

Additional notes regarding PGO via instrumentation. During my profile generation with instrumented ydbd binary via KqpLoadActor I found a strange error, possibly due to hardcoded deadlines - see here: https://github.com/ydb-platform/ydb/blob/main/ydb/core/load_test/kqp.cpp#L332 Since instrumented binaries are much slower, some deadlines shall be adjusted. During my local benchmarking, I just commented out these deadlines and the profile was generated successfully. Possibly, would be better to have an ability to configure the timeout externally without code modification.

zamazan4ik · 2023-03-27T00:15:36Z

Well, I managed to run BOLT with some "magic" options (details are here: llvm/llvm-project#61711).

As expected, BOLT didn't provide a significant performance boost after PGO - but still, I see measurable improvements:

PGO: 35k TPS
PGO + Bolt: 37k TPS

I think Propeller (an alternative approach, similar to BOLT but from Google) could bring almost the same numbers. I tried to test YDB with Propeller... But Propeller requires the latest Clang compiler from the main branch, and YDB has a bunch of compilation errors with it - and right now I have some motivation lack to fix them... Maybe, one day I will test it too :)

eivanov89 · 2023-04-01T11:06:24Z

Hi Alexander Zaitsev, thank you very much for sharing this excellent idea and making the initial experiments. One of our engineers have confirmed your results and working further on integration details. We will be back soon, when collect more data and understand best possible usage.

zamazan4ik · 2023-08-27T03:08:26Z

@eivanov89 do you have updates regarding PGO? If you confirm the results and you find them useful, I suggest adding to the YDB documentation a note regarding tuning YDB with PGO. Here are the examples from other projects, how this documentation can look like:

GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/

Having this kind of information in the official documentation makes optimization opportunities more visible to the end users and maintainers.

eivanov89 · 2023-08-31T14:43:08Z

Hi @zamazan4ik, sorry for delay. We have some issues with our internal tools and build. Hope to solve soon though. But if fail, we will consider applying this to github build only.

zamazan4ik · 2023-08-31T14:45:14Z

But if fail, we will consider applying this to github build only.

Understood. I suggest if you confirm the results above, add a note about PGO to the YDB documentation. So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

eivanov89 · 2023-09-11T09:36:57Z

So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

The tests that we both have used to test PGO are too narrow, imho. We're going to try YCSB and TPC-C to check if real benchmarks benefit same manner as microbenchmarks we have used so far.

alexv-smirnov added the area/dev label Feb 2, 2023

zamazan4ik mentioned this issue Mar 26, 2023

perf2bolt: huge RAM consumption llvm/llvm-project#61711

Open

zamazan4ik mentioned this issue Apr 1, 2023

Add support for more PGO modes Kobzol/cargo-pgo#33

Open

alexey-milovidov mentioned this issue Apr 1, 2023

PGO build ClickHouse/ClickHouse#44567

Closed

zamazan4ik mentioned this issue May 28, 2023

Consider using LTO + PGO + Bolt ytsaurus/ytsaurus#40

Open

zamazan4ik mentioned this issue Aug 24, 2023

docs: add PGO information vectordotdev/vector#18369

Merged

zamazan4ik mentioned this issue Aug 28, 2023

Should support profile-guided optimization apple/foundationdb#1334

Open

zamazan4ik mentioned this issue Dec 4, 2023

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) for C++ part and LLVM clasp-developers/clasp#1526

Open

zamazan4ik mentioned this issue Jan 11, 2024

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) shady-gang/shady#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using LTO + PGO + Bolt #140

Consider using LTO + PGO + Bolt #140

zamazan4ik commented Dec 22, 2022

zamazan4ik commented Mar 25, 2023 •

edited

Loading

zamazan4ik commented Mar 27, 2023 •

edited

Loading

eivanov89 commented Apr 1, 2023

zamazan4ik commented Aug 27, 2023

eivanov89 commented Aug 31, 2023

zamazan4ik commented Aug 31, 2023

eivanov89 commented Sep 11, 2023

Consider using LTO + PGO + Bolt #140

Consider using LTO + PGO + Bolt #140

Comments

zamazan4ik commented Dec 22, 2022

zamazan4ik commented Mar 25, 2023 • edited Loading

zamazan4ik commented Mar 27, 2023 • edited Loading

eivanov89 commented Apr 1, 2023

zamazan4ik commented Aug 27, 2023

eivanov89 commented Aug 31, 2023

zamazan4ik commented Aug 31, 2023

eivanov89 commented Sep 11, 2023

zamazan4ik commented Mar 25, 2023 •

edited

Loading

zamazan4ik commented Mar 27, 2023 •

edited

Loading