Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCreate a benchmark suite for uncovering runtime regressions #31265
Comments
This comment has been minimized.
This comment has been minimized.
|
cc @nrc who was asking about this |
nikomatsakis
added
P-medium
E-mentor
T-compiler
I-nominated
and removed
P-medium
labels
Jan 28, 2016
This comment has been minimized.
This comment has been minimized.
|
Nominating to discuss in today's compiler meeting and for triage. (I guess P-medium?) |
This comment has been minimized.
This comment has been minimized.
|
triage: P-medium |
rust-highfive
added
P-medium
and removed
I-nominated
labels
Jan 28, 2016
This comment has been minimized.
This comment has been minimized.
|
The title says "compile time regressions" but the body says "runtime of generated code". Which, or both, do you plan to measure? (I hope both) |
nrc
changed the title
Create a benchmark suite for uncovering compile time regressions
Create a benchmark suite for uncovering runtime regressions
Jan 28, 2016
This comment has been minimized.
This comment has been minimized.
|
The latter. The former is already measured (http://www.ncameron.org/perf-rustc/). |
This comment has been minimized.
This comment has been minimized.
|
I'd like to help here. |
This comment has been minimized.
This comment has been minimized.
MagaTailor
commented
Jan 29, 2016
|
Great idea, especially something to stress jemalloc ;) |
This comment has been minimized.
This comment has been minimized.
|
@nrc: What format are you using for storing the data for that perf-rustc site? (That site is pretty great by the way!) My thought is that that could be a starting-point for achieving the following:
|
This comment has been minimized.
This comment has been minimized.
|
@dirk The data is stored here: https://github.com/nrc/rustc-timing/tree/master/processed JSON is the short answer. |
This comment has been minimized.
This comment has been minimized.
hexsel
commented
Feb 1, 2016
|
I know this may be too complex for an ongoing, reliable test, but it would be awesome to have something like a small-but-not-trivial web project (Iron? Nickel?). This would exercise lots of string-related operations, a little bit of IO (I assume the request and response should be very small in size), context switching, and linking. Unfortunately, this would require the benchmarks to be moved to a separate project. |
This comment has been minimized.
This comment has been minimized.
|
As an aside, some kind of "benchmark monitoring" solution is a common need (e.g., I know I want the same thing in Rayon). It'd be great if we could write a tool that is easily extensible beyond rustc -- it seems eminently doable. |
This comment has been minimized.
This comment has been minimized.
|
@nikomatsakis I'd be interested in helping out with this effort overall, but I also wanted to ask about a somewhat related project I started after reading your comment above. I did a little hacking on this in the last couple of days: https://github.com/dikaiosune/secondstring It's really crappy still and will need more work to properly do the monitoring you're talking about, but I'm curious if you think my approach so far would be a good fit for the "benchmark monitoring" in your comment above. EDIT: The incremental benchmarking seems to be mostly working now. Mostly. |
This comment has been minimized.
This comment has been minimized.
|
It might be interesting to see if polly has any effects on our testcases. |
This comment has been minimized.
This comment has been minimized.
|
For |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@nrc asked me to share some of my thoughts after doing the above analysis. I still think that that depth of the linked analysis is too shallow to be truly useful, but it did highlight a few questions for me. I don't think all of these need to be answered for a good benchmark suite, but some of them are tough nuts to crack and I'd argue that all of them are important for having a really good automated regression detection protocol. The core problem I see after messing around with these data: it's really tough to identify the difference between a regression and a noisy measurement (or a series of noisy measurements). I imagine that a good improvement to what I did would be a combination of repeated measurements across different environments and more detailed data collection than just runtimes in nanoseconds. @cmr (/u/cmrx64 on reddit, I think that's the same person) suggested a more in depth list of metrics to track:
I come from Java and Python, so I'm not sure how to collect those metrics. I also don't know if that's possible without debug information, which as I understand it does not currently get included when LLVM optimizations are on (relevant since one probably doesn't care about performance regressions in debug code). If it is possible I would be very interested in learning how to do so for Rust executables so I can incorporate those measurements into secondstring. I'm also thinking that I should have run Any method for automatically detecting regressions needs to be sensitive to the historical noise in previous benchmarks. If you look at the benchmarks linked above for byteorder vs. csv, 5% variations would be a big deal for the former, but not for the latter. Along the same lines, a good regression detection tool would account for long-term increases that happen very incrementally. If, for example, the clap benchmarks from November to mid-January represent an actual regression, each uptick in runtime would have been very difficult to tell apart from noise at the time when those nightlies were being released. Another challenge of treating the data "in context": what is the baseline you measure against? Does that baseline ever get updated? For example, looking at the chart for the Wherever a likely regression is spotted, it'd be very useful to pinpoint the exact nightly/merge/version that introduced it. But looking (for example) at the chart for The above math/statistics problems are AIUI areas of active research in time series analysis, but there are existing techniques to do them. I'm planning to do some research on them, but I still think that any benchmark suite should focus on producing noise-free data to ease that analysis process. There's another issue to tease out in terms of benchmarking the quality of code generation vs. the standard library. Most of these benchmarks would be seeing improvements and regressions not only from I had thought about trying to automate benchmarking against nightlies on a VPS machine, but after seeing how much noise there is in bare metal microbenchmarks. I'm hesitant to try that now that it's clear that the microbenchmarks typically run by cargo bench could be subject to a lot of noise. I don't think any of the benchmarks I ran were I/O dependent, and it'd be nice to track Rust's I/O performance. But I'm not sure how this would be best achieved since there are so many variables that could affect the results of a test like that. Since one of Rust's core advantages is parallelism, this is another important area to benchmark. But looking at the I'm fairly happy with calculating a benchmark index by normalizing all of the runtimes against the first corresponding runtime in the series, and then taking the geometric mean of all runtimes at a given point in the series. However I don't know exactly how this behaves under different variations, so I'm not sure what good thresholds would be for warning about regressions, or even drawing a line on the graph to say "we expect points above here to be 10% slower." I think that's partly dependent on how many points are included in each mean. So if a benchmark has 20 functions and they all move 3% (probably just noise for larger tests), that might show the same change as a benchmark with 5 functions that all move 12% (a big deal!). There are two benchmarks in the basic analysis I did which give me confidence that detecting real perf regressions is possible: permutohedron and regex. They both had visible spikes in runtime last fall, followed by returns to the previous baseline in late January. It doesn't look like they were caused in the same nightly (early September for permutohedron and late September for regex). It's also unlikely that both of those changes were caused by noise from the machine they were run on, since the benchmarks weren't run in parallel and it seems unlikely to me that something like a power saving mode would have kicked on and off at such similar points in their relative executions. This is very long now, apologies. TLDRAccurately gauging what's a real performance regression is hard. Designing a benchmark suite that provides clean data for that analysis is also hard. Here are some challenges/questions that poking around some data raised for me:
I'm going to continue working on improving the data collection for secondstring, and I'll also be researching some of the time series stuff to see how much is feasible to implement in a given tool. |
This comment has been minimized.
This comment has been minimized.
|
My 2 cents: don't measure wall-clock time in a multi-threading OS. The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency. Another thing that @cmr actually hit was having the same binary, doing the same task, with several memory profiles (e.g. you could get 2 different graphs, that it would keep jumping between on each run). I'm not sure if we were completely unaware of it back then, but ASLR can definitely cause such behavior, and not just in an allocator, but in Rust code itself: |
This comment has been minimized.
This comment has been minimized.
|
@eddyb So you see the ideal solution as one that uses a bespoke benchmarking library for insert OS here, as opposed to I do think that limiting non-determinism can become a bit of a rabbit-hole. If you keep at it indefinitely, you'll end up needing to design a fresh ISA. I'd argue that eventually there are diminishing returns beyond "sufficiently deterministic," whatever that actually is. |
This comment has been minimized.
This comment has been minimized.
|
In private communication, @BurntSushi suggested the bench_sherlock benchmark for regex. He also wrote:
|
This comment has been minimized.
This comment has been minimized.
|
At the risk of spamming, I did try @eddyb's suggestion of using getrusage to measure time and wrote up a bit about how it went (TLDR: better than using cargo bench, but still not perfect): https://dikaiosune.github.io/rust-runtime-cargobench-vs-getrusage.html I also posted it to the subreddit to see if anyone not following this thread had some ideas about improving the method: https://www.reddit.com/r/rust/comments/47dohh/measuring_rust_runtime_performance_cargo_bench_vs/ |
This comment has been minimized.
This comment has been minimized.
|
Some preliminary work in this repo: https://github.com/nikomatsakis/rust-runtime-benchmarks This is very simple. Basically just some directories where you can run I will also open up some issues for adding more things. Right now it only has the "doom" and "raytracer" benchmarks. I'll do a bit of work on putting regex in there, along with something from LALRPOP, but hopefully can encourage others to add in more things. =) |
This comment has been minimized.
This comment has been minimized.
|
Is the criterion library helpfull? And more generally if we are building a tool to check for performance regressions can we make it easy for the rest of the ecosystem to use? |
This comment has been minimized.
This comment has been minimized.
|
@eddyb did you manage to run rustc with polly enabled and used by llvm opt? I've filled an issue for this: #39884 Ideally the shipped LLVM would always be compiled with polly, and the frontend would have an option that loads it as a plugin and enables it in opt for you. And maybe another option to pass polly options to avoid writing |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg I never got it working, no. |
nikomatsakis
referenced this issue
Jun 12, 2017
Open
tracking issue for monitoring compiler performance #42611
Mark-Simulacrum
added
the
C-tracking-issue
label
Jul 24, 2017
This comment has been minimized.
This comment has been minimized.
|
In case someone is subscribed here and didn't see the thread on irlo, I've been working on lolbench and discussing work over at https://internals.rust-lang.org/t/help-needed-corpus-for-measuring-runtime-performance-of-generated-code/6794/29. |
nikomatsakis commentedJan 28, 2016
•
edited
We need a benchmark suite targeting the runtime of generated code. I've started gathering together code samples into a repository. Here is a list of the projects I plan to extract:
HashMap, not just take from the libsIn addition to curating the benchmarks themselves, we need a good way to run them. There are some tasks associated with that:
cargo benchwill run the relevant tests (this part I expect to get started on --nmatsakis)cargo benchand extract the results into one data setmasterbranch)Eventually, I would want to integrate this into our regular benchmarking computer so that it can be put up on a website, but for now it'd be nice if you could at least run locally.