New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile Guided Optimization (PGO) and LLVM Superoptimizer #1220

Open
6D65 opened this Issue Jul 21, 2015 · 14 comments

Comments

Projects
None yet
@6D65

6D65 commented Jul 21, 2015

Hi,

I'm wondering if it's possible todo PGO with rustc. I was searching and haven't really found something concrete aside from a few messages in mailing list.

I guess it should be possible(if added to the compiler), as it's supported by LLVM(also it looks like Google might be interested in improving this, http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-February/082744.html).

Also, i've heard the work on Superoptimer for LLVM, and i'm curious about what result will it give when run against code generated by rustc.

All of these seem low hanging performance fruit, though enabling them might mean a lot of work.

Thanks

@6D65 6D65 changed the title from Profile Guided Optimization (PGO) and LLVM SuperOptimizer to Profile Guided Optimization (PGO) and LLVM Superoptimizer Jul 21, 2015

@killercup

This comment has been minimized.

Show comment
Hide comment
@killercup

killercup Aug 2, 2015

Member

Some slides about PGO in LLVM (from 2013): http://llvm.org/devmtg/2013-11/slides/Carruth-PGO.pdf

Member

killercup commented Aug 2, 2015

Some slides about PGO in LLVM (from 2013): http://llvm.org/devmtg/2013-11/slides/Carruth-PGO.pdf

@Hywan

This comment has been minimized.

Show comment
Hide comment
@Hywan
Contributor

Hywan commented Apr 15, 2016

@6D65

This comment has been minimized.

Show comment
Hide comment
@6D65

6D65 Apr 16, 2016

@Hywan thank you. This looks like a good first step. 15% performance for the best case scenario, sounds great especially since it's free.

I'm not sure what to do with this issue. It's possible to do outside the rustc, using the llvm toolchain.
Still not supported by rustc, or cargo.

I wonder if this would work as a cargo extension. Running the benchmark tests, and optimize the code around those code paths. Though, not sure how the benchmarks reflect real life usage hot paths.

6D65 commented Apr 16, 2016

@Hywan thank you. This looks like a good first step. 15% performance for the best case scenario, sounds great especially since it's free.

I'm not sure what to do with this issue. It's possible to do outside the rustc, using the llvm toolchain.
Still not supported by rustc, or cargo.

I wonder if this would work as a cargo extension. Running the benchmark tests, and optimize the code around those code paths. Though, not sure how the benchmarks reflect real life usage hot paths.

@Hywan

This comment has been minimized.

Show comment
Hide comment
@Hywan

Hywan Apr 17, 2016

Contributor

ping @Geal, he is the author of the blog post and he could provide relevant answers.

Contributor

Hywan commented Apr 17, 2016

ping @Geal, he is the author of the blog post and he could provide relevant answers.

@Geal

This comment has been minimized.

Show comment
Hide comment
@Geal

Geal Apr 17, 2016

There are three ways this can happen practically:

  • generate profiling data from the benchmarks, and optimize the build directly. But benchmarks are hardly representative of real world usage
  • the code can ship with some pre-generated profiling data (the developer tested it extensively for that release, and provides the profiling data), and the Cargo.toml can refer to
  • we make it more of a command line option for cargo, and the end user chooses to use it or not, and has to profile data once

There's also the issue that I mentioned in my post: that test only applies on one crate with no dependencies except libstd. What happens when you have multiple libraries as dependencies? Do you generate the profiling data for all of them? It would be nice to optimize the end program and the dependencies based on usage data of the end program.

As a first step, having it in rustc as a llvm-args option is enough to test and see if it should be automated a bit with cargo. I read somewhere else that adding options to cargo is frowned upon, since it would complexify it too much.

Geal commented Apr 17, 2016

There are three ways this can happen practically:

  • generate profiling data from the benchmarks, and optimize the build directly. But benchmarks are hardly representative of real world usage
  • the code can ship with some pre-generated profiling data (the developer tested it extensively for that release, and provides the profiling data), and the Cargo.toml can refer to
  • we make it more of a command line option for cargo, and the end user chooses to use it or not, and has to profile data once

There's also the issue that I mentioned in my post: that test only applies on one crate with no dependencies except libstd. What happens when you have multiple libraries as dependencies? Do you generate the profiling data for all of them? It would be nice to optimize the end program and the dependencies based on usage data of the end program.

As a first step, having it in rustc as a llvm-args option is enough to test and see if it should be automated a bit with cargo. I read somewhere else that adding options to cargo is frowned upon, since it would complexify it too much.

@keean

This comment has been minimized.

Show comment
Hide comment
@keean

keean May 5, 2016

I do a lot of PGO in C++ mainly using GCC. I tend to have a specific function to run the code kernels that need heavy optimisation many times with sample data. I would suggest something like the #[test] pragma used in testing. So I have some functions marked with #[profile]. Then when I compile it should clear the profile data, build once with instrumentation, run the marked 'profile' functions to generate the profile data, and then build a second time without instrumentation using the generated profile data. I would suggest this is all handled by cargo, perhaps with an optimisation level, so the level 4 optimisation does this whole profile build process when you run "cargo build".

keean commented May 5, 2016

I do a lot of PGO in C++ mainly using GCC. I tend to have a specific function to run the code kernels that need heavy optimisation many times with sample data. I would suggest something like the #[test] pragma used in testing. So I have some functions marked with #[profile]. Then when I compile it should clear the profile data, build once with instrumentation, run the marked 'profile' functions to generate the profile data, and then build a second time without instrumentation using the generated profile data. I would suggest this is all handled by cargo, perhaps with an optimisation level, so the level 4 optimisation does this whole profile build process when you run "cargo build".

@valarauca

This comment has been minimized.

Show comment
Hide comment
@valarauca

valarauca Oct 15, 2016

Instead of using an extension to Cargo.toml wouldn't it be a more eloquent solution to use something like #[profile] in a form that would mirror #[test] and #[bench] ? Moreover wouldn't #[bench] be perfect to pull profdata from?

valarauca commented Oct 15, 2016

Instead of using an extension to Cargo.toml wouldn't it be a more eloquent solution to use something like #[profile] in a form that would mirror #[test] and #[bench] ? Moreover wouldn't #[bench] be perfect to pull profdata from?

@Permutatrix

This comment has been minimized.

Show comment
Hide comment
@Permutatrix

Permutatrix Nov 13, 2016

@valarauca I think it's safe to say that a run through the benchmarks doesn't generally represent a typical use of the code in the real world. For instance, if you have a small function that's called in a tight loop within a larger function, in many cases it's reasonable to benchmark the former and not the latter. But if you give the optimizer a profile based on those benchmarks, it won't see how often the larger function calls the smaller one, so it might not inline that call—making the loop slower than it was without the profile! Benchmarks also tend to follow the exact same code path over and over, which seems rather nonsensical for profiling purposes, since you don't learn anything new after the first iteration.

I agree with you on the first part, though. I like @keean's suggestion a lot. I've never actually used PGO before, but it seems like maybe a #[profile] function could be treated as a program entry point and each one called a single time in its own process.

Permutatrix commented Nov 13, 2016

@valarauca I think it's safe to say that a run through the benchmarks doesn't generally represent a typical use of the code in the real world. For instance, if you have a small function that's called in a tight loop within a larger function, in many cases it's reasonable to benchmark the former and not the latter. But if you give the optimizer a profile based on those benchmarks, it won't see how often the larger function calls the smaller one, so it might not inline that call—making the loop slower than it was without the profile! Benchmarks also tend to follow the exact same code path over and over, which seems rather nonsensical for profiling purposes, since you don't learn anything new after the first iteration.

I agree with you on the first part, though. I like @keean's suggestion a lot. I've never actually used PGO before, but it seems like maybe a #[profile] function could be treated as a program entry point and each one called a single time in its own process.

@scottlamb

This comment has been minimized.

Show comment
Hide comment
@scottlamb

scottlamb Nov 14, 2016

fwiw, I'm looking forward to this feature, and a #[profile] makes sense to me as well. I think the things I run through in a #[profile] are likely to be most/all of ones I mark as #[bench]. But I could imagine having a benchmark of some case that's interesting but not representative of expected load and wanting the ability to exclude it.

fwiw, my setup at work is the second option @Geal described: use pre-generated profiling data. In particular, I use AutoFDO. A fancy pipeline gathers stats from my binary as it serves production traffic and saves a profile to version control. Subsequent builds use the latest available profile. For my servers, this is about a 15% performance improvement. (The paper says, more generally, that AutoFDO yields "improvements commonly in the 10-15% range and sometimes over 30%".) So pre-generated profiles are absolutely valuable to support; autofdo is great when you have a way to instrument your real binary under real, consistent load. But I think it'd be hard to get that real, consistent load for mobile/desktop apps. And autofdo requires Intel-specific hardware counters (on bare metal—they don't seem to work in VMware Fusion). It'd be a pain for the personal Rust project I'm working on now; I'd rather just write a #[profile].

scottlamb commented Nov 14, 2016

fwiw, I'm looking forward to this feature, and a #[profile] makes sense to me as well. I think the things I run through in a #[profile] are likely to be most/all of ones I mark as #[bench]. But I could imagine having a benchmark of some case that's interesting but not representative of expected load and wanting the ability to exclude it.

fwiw, my setup at work is the second option @Geal described: use pre-generated profiling data. In particular, I use AutoFDO. A fancy pipeline gathers stats from my binary as it serves production traffic and saves a profile to version control. Subsequent builds use the latest available profile. For my servers, this is about a 15% performance improvement. (The paper says, more generally, that AutoFDO yields "improvements commonly in the 10-15% range and sometimes over 30%".) So pre-generated profiles are absolutely valuable to support; autofdo is great when you have a way to instrument your real binary under real, consistent load. But I think it'd be hard to get that real, consistent load for mobile/desktop apps. And autofdo requires Intel-specific hardware counters (on bare metal—they don't seem to work in VMware Fusion). It'd be a pain for the personal Rust project I'm working on now; I'd rather just write a #[profile].

@valarauca

This comment has been minimized.

Show comment
Hide comment
@valarauca

valarauca Nov 14, 2016

@Permutatrix

I think it's safe to say that a run through the benchmarks doesn't generally represent a typical use of the code in the real world.

Fair statement.

But if you give the optimizer a profile based on those benchmarks, it won't see how often the larger function calls the smaller one, so it might not inline that call—making the loop slower than it was without the profile!

I understand that #[profile] isn't a full solution, it requires developer buy in, documentation, and cooperation with the compiler to get the full benefit. Just like #[test]. It doesn't guarantee all your code is correct unless.

There would still be some branch hinting optimizations out of the case you outlined. So there is still some, largely trivial gains albeit much like sparse #[test] usage. Overall I feel this is a good fit, because like #[test] one would have to use #[profile] at near 100% coverage to get the full benefit of it.

Lastly if done correct #[profile] can becomes the vast majority of your #[test] suite to a decent margin baring edge cases. This would likely involving pruning assert!()-like statements from the PGO output, maybe separate runs with assertions and without depending on cargo test vs cargo --release-profile idk.

valarauca commented Nov 14, 2016

@Permutatrix

I think it's safe to say that a run through the benchmarks doesn't generally represent a typical use of the code in the real world.

Fair statement.

But if you give the optimizer a profile based on those benchmarks, it won't see how often the larger function calls the smaller one, so it might not inline that call—making the loop slower than it was without the profile!

I understand that #[profile] isn't a full solution, it requires developer buy in, documentation, and cooperation with the compiler to get the full benefit. Just like #[test]. It doesn't guarantee all your code is correct unless.

There would still be some branch hinting optimizations out of the case you outlined. So there is still some, largely trivial gains albeit much like sparse #[test] usage. Overall I feel this is a good fit, because like #[test] one would have to use #[profile] at near 100% coverage to get the full benefit of it.

Lastly if done correct #[profile] can becomes the vast majority of your #[test] suite to a decent margin baring edge cases. This would likely involving pruning assert!()-like statements from the PGO output, maybe separate runs with assertions and without depending on cargo test vs cargo --release-profile idk.

@Vurich

This comment has been minimized.

Show comment
Hide comment
@Vurich

Vurich Jul 10, 2017

This would be great for parity, since we have one major path that we care very highly about (the import speed) and that we profile for. We could just run it over and over again for the sake of PGO. There's probably a bunch of cold paths that aren't marked as such in the bitcode.

I think as far as design goes, I would have a Cargo.profile that stores profile information and gets checked in, like Cargo.lock. You could run a cargo profile command that compiles one or more #[profile] functions that act similarly to #[bench], and outputs to Cargo.profile. When you build release it just draws on the same profile information. This would be massively unstable, but it's a pretty niche feature anyway so I think that's OK.

Vurich commented Jul 10, 2017

This would be great for parity, since we have one major path that we care very highly about (the import speed) and that we profile for. We could just run it over and over again for the sake of PGO. There's probably a bunch of cold paths that aren't marked as such in the bitcode.

I think as far as design goes, I would have a Cargo.profile that stores profile information and gets checked in, like Cargo.lock. You could run a cargo profile command that compiles one or more #[profile] functions that act similarly to #[bench], and outputs to Cargo.profile. When you build release it just draws on the same profile information. This would be massively unstable, but it's a pretty niche feature anyway so I think that's OK.

@emilio

This comment has been minimized.

Show comment
Hide comment
@emilio

emilio Feb 12, 2018

It'd be also great for Firefox. We had to do a lot of manual tweaking to querySelector to be comparable to the C++ version, that could presumably get some cleanup if we get PGO.

emilio commented Feb 12, 2018

It'd be also great for Firefox. We had to do a lot of manual tweaking to querySelector to be comparable to the C++ version, that could presumably get some cleanup if we get PGO.

@emilio

This comment has been minimized.

Show comment
Hide comment
@emilio

emilio Feb 19, 2018

I hacked on this this weekend, and I think I got something to work, will post a WIP PR for feedback soon :)

emilio commented Feb 19, 2018

I hacked on this this weekend, and I think I got something to work, will post a WIP PR for feedback soon :)

@emilio

This comment has been minimized.

Show comment
Hide comment
@emilio

emilio Feb 19, 2018

(Also note that I didn't do the cargo integration on top of it, I only did the bits so that profile usage and generation can go through rustc instead of --emit llvm-bc + opt + clang)

emilio commented Feb 19, 2018

(Also note that I didn't do the cargo integration on top of it, I only did the bits so that profile usage and generation can go through rustc instead of --emit llvm-bc + opt + clang)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment