-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a benchmark suite for uncovering runtime regressions #31265
Comments
cc @nrc who was asking about this |
Nominating to discuss in today's compiler meeting and for triage. (I guess P-medium?) |
triage: P-medium |
The title says "compile time regressions" but the body says "runtime of generated code". Which, or both, do you plan to measure? (I hope both) |
The latter. The former is already measured (http://www.ncameron.org/perf-rustc/). |
I'd like to help here. |
Great idea, especially something to stress jemalloc ;) |
@nrc: What format are you using for storing the data for that perf-rustc site? (That site is pretty great by the way!) My thought is that that could be a starting-point for achieving the following:
|
@dirk The data is stored here: https://github.com/nrc/rustc-timing/tree/master/processed JSON is the short answer. |
I know this may be too complex for an ongoing, reliable test, but it would be awesome to have something like a small-but-not-trivial web project (Iron? Nickel?). This would exercise lots of string-related operations, a little bit of IO (I assume the request and response should be very small in size), context switching, and linking. Unfortunately, this would require the benchmarks to be moved to a separate project. |
As an aside, some kind of "benchmark monitoring" solution is a common need (e.g., I know I want the same thing in Rayon). It'd be great if we could write a tool that is easily extensible beyond rustc -- it seems eminently doable. |
@nikomatsakis I'd be interested in helping out with this effort overall, but I also wanted to ask about a somewhat related project I started after reading your comment above. I did a little hacking on this in the last couple of days: https://github.com/dikaiosune/secondstring It's really crappy still and will need more work to properly do the monitoring you're talking about, but I'm curious if you think my approach so far would be a good fit for the "benchmark monitoring" in your comment above. EDIT: The incremental benchmarking seems to be mostly working now. Mostly. |
It might be interesting to see if polly has any effects on our testcases. |
For |
@nrc asked me to share some of my thoughts after doing the above analysis. I still think that that depth of the linked analysis is too shallow to be truly useful, but it did highlight a few questions for me. I don't think all of these need to be answered for a good benchmark suite, but some of them are tough nuts to crack and I'd argue that all of them are important for having a really good automated regression detection protocol. The core problem I see after messing around with these data: it's really tough to identify the difference between a regression and a noisy measurement (or a series of noisy measurements). I imagine that a good improvement to what I did would be a combination of repeated measurements across different environments and more detailed data collection than just runtimes in nanoseconds. @cmr (/u/cmrx64 on reddit, I think that's the same person) suggested a more in depth list of metrics to track:
I come from Java and Python, so I'm not sure how to collect those metrics. I also don't know if that's possible without debug information, which as I understand it does not currently get included when LLVM optimizations are on (relevant since one probably doesn't care about performance regressions in debug code). If it is possible I would be very interested in learning how to do so for Rust executables so I can incorporate those measurements into secondstring. I'm also thinking that I should have run Any method for automatically detecting regressions needs to be sensitive to the historical noise in previous benchmarks. If you look at the benchmarks linked above for byteorder vs. csv, 5% variations would be a big deal for the former, but not for the latter. Along the same lines, a good regression detection tool would account for long-term increases that happen very incrementally. If, for example, the clap benchmarks from November to mid-January represent an actual regression, each uptick in runtime would have been very difficult to tell apart from noise at the time when those nightlies were being released. Another challenge of treating the data "in context": what is the baseline you measure against? Does that baseline ever get updated? For example, looking at the chart for the Wherever a likely regression is spotted, it'd be very useful to pinpoint the exact nightly/merge/version that introduced it. But looking (for example) at the chart for The above math/statistics problems are AIUI areas of active research in time series analysis, but there are existing techniques to do them. I'm planning to do some research on them, but I still think that any benchmark suite should focus on producing noise-free data to ease that analysis process. There's another issue to tease out in terms of benchmarking the quality of code generation vs. the standard library. Most of these benchmarks would be seeing improvements and regressions not only from I had thought about trying to automate benchmarking against nightlies on a VPS machine, but after seeing how much noise there is in bare metal microbenchmarks. I'm hesitant to try that now that it's clear that the microbenchmarks typically run by cargo bench could be subject to a lot of noise. I don't think any of the benchmarks I ran were I/O dependent, and it'd be nice to track Rust's I/O performance. But I'm not sure how this would be best achieved since there are so many variables that could affect the results of a test like that. Since one of Rust's core advantages is parallelism, this is another important area to benchmark. But looking at the I'm fairly happy with calculating a benchmark index by normalizing all of the runtimes against the first corresponding runtime in the series, and then taking the geometric mean of all runtimes at a given point in the series. However I don't know exactly how this behaves under different variations, so I'm not sure what good thresholds would be for warning about regressions, or even drawing a line on the graph to say "we expect points above here to be 10% slower." I think that's partly dependent on how many points are included in each mean. So if a benchmark has 20 functions and they all move 3% (probably just noise for larger tests), that might show the same change as a benchmark with 5 functions that all move 12% (a big deal!). There are two benchmarks in the basic analysis I did which give me confidence that detecting real perf regressions is possible: permutohedron and regex. They both had visible spikes in runtime last fall, followed by returns to the previous baseline in late January. It doesn't look like they were caused in the same nightly (early September for permutohedron and late September for regex). It's also unlikely that both of those changes were caused by noise from the machine they were run on, since the benchmarks weren't run in parallel and it seems unlikely to me that something like a power saving mode would have kicked on and off at such similar points in their relative executions. This is very long now, apologies. TLDRAccurately gauging what's a real performance regression is hard. Designing a benchmark suite that provides clean data for that analysis is also hard. Here are some challenges/questions that poking around some data raised for me:
I'm going to continue working on improving the data collection for secondstring, and I'll also be researching some of the time series stuff to see how much is feasible to implement in a given tool. |
My 2 cents: don't measure wall-clock time in a multi-threading OS. The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency. Another thing that @cmr actually hit was having the same binary, doing the same task, with several memory profiles (e.g. you could get 2 different graphs, that it would keep jumping between on each run). I'm not sure if we were completely unaware of it back then, but ASLR can definitely cause such behavior, and not just in an allocator, but in Rust code itself: |
@eddyb So you see the ideal solution as one that uses a bespoke benchmarking library for insert OS here, as opposed to I do think that limiting non-determinism can become a bit of a rabbit-hole. If you keep at it indefinitely, you'll end up needing to design a fresh ISA. I'd argue that eventually there are diminishing returns beyond "sufficiently deterministic," whatever that actually is. |
In private communication, @BurntSushi suggested the bench_sherlock benchmark for regex. He also wrote:
|
At the risk of spamming, I did try @eddyb's suggestion of using getrusage to measure time and wrote up a bit about how it went (TLDR: better than using cargo bench, but still not perfect): https://dikaiosune.github.io/rust-runtime-cargobench-vs-getrusage.html I also posted it to the subreddit to see if anyone not following this thread had some ideas about improving the method: https://www.reddit.com/r/rust/comments/47dohh/measuring_rust_runtime_performance_cargo_bench_vs/ |
Some preliminary work in this repo: https://github.com/nikomatsakis/rust-runtime-benchmarks This is very simple. Basically just some directories where you can run I will also open up some issues for adding more things. Right now it only has the "doom" and "raytracer" benchmarks. I'll do a bit of work on putting regex in there, along with something from LALRPOP, but hopefully can encourage others to add in more things. =) |
Is the criterion library helpfull? And more generally if we are building a tool to check for performance regressions can we make it easy for the rest of the ecosystem to use? |
@eddyb did you manage to run rustc with polly enabled and used by llvm opt? I've filled an issue for this: #39884 Ideally the shipped LLVM would always be compiled with polly, and the frontend would have an option that loads it as a plugin and enables it in opt for you. And maybe another option to pass polly options to avoid writing |
@gnzlbg I never got it working, no. |
In case someone is subscribed here and didn't see the thread on irlo, I've been working on lolbench and discussing work over at https://internals.rust-lang.org/t/help-needed-corpus-for-measuring-runtime-performance-of-generated-code/6794/29. |
Hello! What are things looking like for this nowadays? Our plan for |
Hey! Yeah, I chose a very "get it out the door" approach for storing the data from benchmark runs that I think will need to be redone as the first step in reviving the project. I'd like to revive it soon for a bunch of reasons, but a few in particular:
I'm still not sure what a reasonable definition of "soon" is for me, though. Hopefully this year? If someone is motivated and wants to help get it back up and running in the meantime, let me know! I'll happily find time to write mentoring instructions. |
discussed at today's T-compiler backlog bonanza. This is work that needs doing, but it should not be tracked here. We're shifting the conversation over to rust-lang/rustc-perf#69 instead, and closing this issue. |
We need a benchmark suite targeting the runtime of generated code. I've started gathering together code samples into a repository. Here is a list of the projects I plan to extract:
HashMap
, not just take from the libsIn addition to curating the benchmarks themselves, we need a good way to run them. There are some tasks associated with that:
cargo bench
will run the relevant tests (this part I expect to get started on --nmatsakis)cargo bench
and extract the results into one data setmaster
branch)Eventually, I would want to integrate this into our regular benchmarking computer so that it can be put up on a website, but for now it'd be nice if you could at least run locally.
The text was updated successfully, but these errors were encountered: