Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #236

Closed
zamazan4ik opened this issue Oct 7, 2023 · 5 comments
Closed

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #236

zamazan4ik opened this issue Oct 7, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@zamazan4ik
Copy link

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to jql. I already performed some benchmarks and want to share my results here.

Test environment

  • Fedora 38
  • Linux kernel 6.5.5
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.73
  • jql version: the latest for now from the main branch on commit 7729cbafaea0367c2f86227234fb3f8e9a8fd905
  • Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use the scenario from https://github.com/yamafaktory/jql/blob/main/performance.sh , just a bit tweaked (removed vanilla jq invocations, add multiple jql versions to the script) - edited version is available here. Release build is done with cargo build --release, PGO optimized build is done with cargo-pgo.

I tested 3 configurations:

  • Default Release, binary called jql_z
  • Tweked Release (use opt-level = 3), binary called jql_opt_3
  • PGO-optimized, binary called jql_optimized

PGO profiles were collected from the same workload in performance.sh and merged via llvm-profdata during the PGO optimization phase.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

Results

I got the following results from running performance.sh:

./performance.sh
Benchmark 1: echo '{ "foo": "bar" }' | ./jql_z '"foo"'
  Time (mean ± σ):       4.5 ms ±   0.2 ms    [User: 1.5 ms, System: 7.6 ms]
  Range (min … max):     4.0 ms …   6.1 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: The first benchmarking run for this command was significantly slower than the rest (6.1 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 2: echo '{ "foo": "bar" }' | ./jql_opt_3 '"foo"'
  Time (mean ± σ):       4.5 ms ±   0.4 ms    [User: 1.4 ms, System: 7.8 ms]
  Range (min … max):     4.1 ms …  14.4 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: The first benchmarking run for this command was significantly slower than the rest (14.4 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 3: echo '{ "foo": "bar" }' | ./jql_optimized '"foo"'
  Time (mean ± σ):       4.4 ms ±   0.2 ms    [User: 1.2 ms, System: 7.7 ms]
  Range (min … max):     4.1 ms …   6.4 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: The first benchmarking run for this command was significantly slower than the rest (6.4 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Summary
  echo '{ "foo": "bar" }' | ./jql_optimized '"foo"' ran
    1.00 ± 0.05 times faster than echo '{ "foo": "bar" }' | ./jql_z '"foo"'
    1.01 ± 0.09 times faster than echo '{ "foo": "bar" }' | ./jql_opt_3 '"foo"'
Benchmark 1: echo '[1, 2, 3]' | ./jql_z '[0]'
  Time (mean ± σ):       4.6 ms ±   0.2 ms    [User: 1.4 ms, System: 7.9 ms]
  Range (min … max):     4.2 ms …   5.9 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: echo '[1, 2, 3]' | ./jql_opt_3 '[0]'
  Time (mean ± σ):       4.6 ms ±   0.1 ms    [User: 1.3 ms, System: 8.1 ms]
  Range (min … max):     4.2 ms …   5.2 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 3: echo '[1, 2, 3]' | ./jql_optimized '[0]'
  Time (mean ± σ):       4.6 ms ±   0.1 ms    [User: 1.2 ms, System: 7.9 ms]
  Range (min … max):     4.2 ms …   5.6 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Summary
  echo '[1, 2, 3]' | ./jql_z '[0]' ran
    1.00 ± 0.05 times faster than echo '[1, 2, 3]' | ./jql_optimized '[0]'
    1.01 ± 0.05 times faster than echo '[1, 2, 3]' | ./jql_opt_3 '[0]'
Benchmark 1: echo '[1, [2], [[3]]]' | ./jql_z '..'
  Time (mean ± σ):       4.7 ms ±   0.2 ms    [User: 1.8 ms, System: 8.0 ms]
  Range (min … max):     4.2 ms …   6.2 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: echo '[1, [2], [[3]]]' | ./jql_opt_3 '..'
  Time (mean ± σ):       4.7 ms ±   0.2 ms    [User: 1.6 ms, System: 8.1 ms]
  Range (min … max):     4.3 ms …   5.7 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 3: echo '[1, [2], [[3]]]' | ./jql_optimized '..'
  Time (mean ± σ):       4.7 ms ±   0.1 ms    [User: 1.5 ms, System: 8.1 ms]
  Range (min … max):     4.3 ms …   5.4 ms    1000 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Summary
  echo '[1, [2], [[3]]]' | ./jql_optimized '..' ran
    1.00 ± 0.04 times faster than echo '[1, [2], [[3]]]' | ./jql_opt_3 '..'
    1.00 ± 0.04 times faster than echo '[1, [2], [[3]]]' | ./jql_z '..'
Benchmark 1: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_z '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
  Time (mean ± σ):      20.6 ms ±   1.3 ms    [User: 14.7 ms, System: 26.0 ms]
  Range (min … max):    18.6 ms …  31.5 ms    1000 runs

Benchmark 2: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_opt_3 '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
  Time (mean ± σ):      19.9 ms ±   1.2 ms    [User: 11.8 ms, System: 26.0 ms]
  Range (min … max):    17.8 ms …  26.3 ms    1000 runs

Benchmark 3: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_optimized '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
  Time (mean ± σ):      19.3 ms ±   1.3 ms    [User: 10.6 ms, System: 26.6 ms]
  Range (min … max):    17.1 ms …  26.2 ms    1000 runs

Summary
  cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_optimized '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null ran
    1.04 ± 0.09 times faster than cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_opt_3 '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
    1.07 ± 0.10 times faster than cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_z '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null

The same results in performance.md format:

───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: PERFORMANCE.md
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
   2   │ |:---|---:|---:|---:|---:|
   3   │ | `echo '[1, [2], [[3]]]' \| ./jql_z '..'` | 4.7 ± 0.2 | 4.2 | 6.2 | 1.00 ± 0.04 |
   4   │ | `echo '[1, [2], [[3]]]' \| ./jql_opt_3 '..'` | 4.7 ± 0.2 | 4.3 | 5.7 | 1.00 ± 0.04 |
   5   │ | `echo '[1, [2], [[3]]]' \| ./jql_optimized '..'` | 4.7 ± 0.1 | 4.3 | 5.4 | 1.00 |
   6   │
   7   │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
   8   │ |:---|---:|---:|---:|---:|
   9   │ | `echo '[1, 2, 3]' \| ./jql_z '[0]'` | 4.6 ± 0.2 | 4.2 | 5.9 | 1.00 |
  10   │ | `echo '[1, 2, 3]' \| ./jql_opt_3 '[0]'` | 4.6 ± 0.1 | 4.2 | 5.2 | 1.01 ± 0.05 |
  11   │ | `echo '[1, 2, 3]' \| ./jql_optimized '[0]'` | 4.6 ± 0.1 | 4.2 | 5.6 | 1.00 ± 0.05 |
  12   │
  13   │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
  14   │ |:---|---:|---:|---:|---:|
  15   │ | `echo '{ "foo": "bar" }' \| ./jql_z '"foo"'` | 4.5 ± 0.2 | 4.0 | 6.1 | 1.00 ± 0.05 |
  16   │ | `echo '{ "foo": "bar" }' \| ./jql_opt_3 '"foo"'` | 4.5 ± 0.4 | 4.1 | 14.4 | 1.01 ± 0.09 |
  17   │ | `echo '{ "foo": "bar" }' \| ./jql_optimized '"foo"'` | 4.4 ± 0.2 | 4.1 | 6.4 | 1.00 |
  18   │
  19   │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
  20   │ |:---|---:|---:|---:|---:|
  21   │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_z '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null` | 20.6 ±
       │  1.3 | 18.6 | 31.5 | 1.07 ± 0.10 |
  22   │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_opt_3 '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null` | 19
       │ .9 ± 1.2 | 17.8 | 26.3 | 1.04 ± 0.09 |
  23   │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_optimized '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null`
       │ | 19.3 ± 1.3 | 17.1 | 26.2 | 1.00 |
  24   │
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

According to the tests, sometimes user time is significantly decreased with the PGO optimization.

Further steps

I can suggest the following things to do:

  • Evaluate PGO's applicability to jql in more scenarios.
  • If PGO helps to achieve better performance - add a note to jql's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for jql.
  • Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
  • Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

@yamafaktory yamafaktory added the enhancement New feature or request label Oct 8, 2023
@yamafaktory
Copy link
Owner

Hi @zamazan4ik,

Thanks for the effort and the adjusted tests. I need to evaluate whether or not this might be potentially beneficial for the project.

@workingjubilee
Copy link

BOLT has caused many problems when combined with other opts such as PGO in the upstream Rust toolchain. I recommend applying caution about layering PGO, BOLT, strip... we get away with it partly by keeping close in touch with a gaggle of assorted LLVM and compiler experts, not something I'd wish on a smaller project.

PGO should be fine to apply on its own.

@yamafaktory
Copy link
Owner

Thanks a lot for the wonderful feedback/guidance @workingjubilee ❤️ !

@yamafaktory
Copy link
Owner

Update: I've just refactored / updated the CI workflows (#240) and will try to find some time to tackle this issue.

@yamafaktory
Copy link
Owner

After giving it some more thoughts, the main issue at the moment is on the cross-compilation side since the project uses cross-rs to generate the binaries. I talked to one of the main contributor of this project on matrix and right now this might be hard:

basically, you can call profile-optimization yourself, however using the cargo-pgo binary would be hard today due to the lack of cross-rs/cross#716

however, there is a new feature on the main branch you could exploit, which would be using cross-util run --target -- where command could here be cargo pgo

this will work if cargo-pgo was built for x86_64-unknown-linux-gnu and installed in $CARGO_HOME/bin

TL;DR: I'll keep this thread as a discussion but won't do anything yet.

Repository owner locked and limited conversation to collaborators Nov 25, 2023
@yamafaktory yamafaktory converted this issue into discussion #244 Nov 25, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants