Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a benchmark suite for uncovering runtime regressions #31265

Closed
1 of 11 tasks
nikomatsakis opened this issue Jan 28, 2016 · 28 comments
Closed
1 of 11 tasks

Create a benchmark suite for uncovering runtime regressions #31265

nikomatsakis opened this issue Jan 28, 2016 · 28 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Jan 28, 2016

We need a benchmark suite targeting the runtime of generated code. I've started gathering together code samples into a repository. Here is a list of the projects I plan to extract:

In addition to curating the benchmarks themselves, we need a good way to run them. There are some tasks associated with that:

  • Put the projects in a repo with cargo setup such that cargo bench will run the relevant tests (this part I expect to get started on --nmatsakis)
  • Write a script that will execute cargo bench and extract the results into one data set
  • Provide some way to save that data set to disk and to compare against other data sets (e.g., runs of the master branch)

Eventually, I would want to integrate this into our regular benchmarking computer so that it can be put up on a website, but for now it'd be nice if you could at least run locally.

@nikomatsakis
Copy link
Contributor Author

cc @nrc who was asking about this

@nikomatsakis nikomatsakis added P-medium Medium priority E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-nominated and removed P-medium Medium priority labels Jan 28, 2016
@nikomatsakis
Copy link
Contributor Author

Nominating to discuss in today's compiler meeting and for triage. (I guess P-medium?)

@nikomatsakis
Copy link
Contributor Author

triage: P-medium

@rust-highfive rust-highfive added P-medium Medium priority and removed I-nominated labels Jan 28, 2016
@durka
Copy link
Contributor

durka commented Jan 28, 2016

The title says "compile time regressions" but the body says "runtime of generated code". Which, or both, do you plan to measure? (I hope both)

@nrc nrc changed the title Create a benchmark suite for uncovering compile time regressions Create a benchmark suite for uncovering runtime regressions Jan 28, 2016
@nrc
Copy link
Member

nrc commented Jan 28, 2016

The latter. The former is already measured (http://www.ncameron.org/perf-rustc/).

@palango
Copy link
Contributor

palango commented Jan 29, 2016

I'd like to help here.
What do you think about about pulling the hard and medium Regexes from the link and running them on some given data. This data could be loaded from a file or generated by some constant seed Rng. The first would probably be somehow IO dependent while the latter would test the Rng as well.

@MagaTailor
Copy link

Great idea, especially something to stress jemalloc ;)

@dirk
Copy link
Contributor

dirk commented Jan 30, 2016

@nrc: What format are you using for storing the data for that perf-rustc site? (That site is pretty great by the way!) My thought is that that could be a starting-point for achieving the following:

Write a script that will execute cargo bench and extract the results into one data set

@nrc
Copy link
Member

nrc commented Feb 1, 2016

@dirk The data is stored here: https://github.com/nrc/rustc-timing/tree/master/processed JSON is the short answer.

@hexsel
Copy link

hexsel commented Feb 1, 2016

I know this may be too complex for an ongoing, reliable test, but it would be awesome to have something like a small-but-not-trivial web project (Iron? Nickel?). This would exercise lots of string-related operations, a little bit of IO (I assume the request and response should be very small in size), context switching, and linking.

Unfortunately, this would require the benchmarks to be moved to a separate project.

@nikomatsakis
Copy link
Contributor Author

As an aside, some kind of "benchmark monitoring" solution is a common need (e.g., I know I want the same thing in Rayon). It'd be great if we could write a tool that is easily extensible beyond rustc -- it seems eminently doable.

@anp
Copy link
Member

anp commented Feb 11, 2016

@nikomatsakis I'd be interested in helping out with this effort overall, but I also wanted to ask about a somewhat related project I started after reading your comment above.

I did a little hacking on this in the last couple of days:

https://github.com/dikaiosune/secondstring

It's really crappy still and will need more work to properly do the monitoring you're talking about, but I'm curious if you think my approach so far would be a good fit for the "benchmark monitoring" in your comment above. Soon I'll be implementing saving the results to disk and diffing all benchmark runs against the previously saved results so that benchmarks can be run incrementally. Do you think it seems like it would be useful for this effort? What would be needed for a proper benchmark monitoring tool as it relates this this issue?

EDIT: The incremental benchmarking seems to be mostly working now. Mostly.

@eddyb
Copy link
Member

eddyb commented Feb 15, 2016

It might be interesting to see if polly has any effects on our testcases.

@BurntSushi
Copy link
Member

For regex, you probably want benches/bench.rs and benches/bench_sherlock.rs. They can be run with cargo bench --bench dynamic. (The same benchmarks run on other configurations of the regex, but the dynamic bench is the primary one that benchmarks Regex::new.)

@anp
Copy link
Member

anp commented Feb 17, 2016

@nrc asked me to share some of my thoughts after doing the above analysis. I still think that that depth of the linked analysis is too shallow to be truly useful, but it did highlight a few questions for me. I don't think all of these need to be answered for a good benchmark suite, but some of them are tough nuts to crack and I'd argue that all of them are important for having a really good automated regression detection protocol.

The core problem I see after messing around with these data: it's really tough to identify the difference between a regression and a noisy measurement (or a series of noisy measurements). I imagine that a good improvement to what I did would be a combination of repeated measurements across different environments and more detailed data collection than just runtimes in nanoseconds. @cmr (/u/cmrx64 on reddit, I think that's the same person) suggested a more in depth list of metrics to track:

code size, number of basic blocks and their sizes, branch mispredicts, cache misses, or instructions executed, and maybe where the most time is spent

I come from Java and Python, so I'm not sure how to collect those metrics. I also don't know if that's possible without debug information, which as I understand it does not currently get included when LLVM optimizations are on (relevant since one probably doesn't care about performance regressions in debug code). If it is possible I would be very interested in learning how to do so for Rust executables so I can incorporate those measurements into secondstring. I'm also thinking that I should have run cargo bench multiple times in a row for each benchmark, as well as used the +/- margin reported to create confidence intervals in the charts.

Any method for automatically detecting regressions needs to be sensitive to the historical noise in previous benchmarks. If you look at the benchmarks linked above for byteorder vs. csv, 5% variations would be a big deal for the former, but not for the latter. Along the same lines, a good regression detection tool would account for long-term increases that happen very incrementally. If, for example, the clap benchmarks from November to mid-January represent an actual regression, each uptick in runtime would have been very difficult to tell apart from noise at the time when those nightlies were being released.

Another challenge of treating the data "in context": what is the baseline you measure against? Does that baseline ever get updated? For example, looking at the chart for the suffix crate (assuming those runtimes are representative), should the beginning of January be the new baseline? Or should one only care about regressions above the point where measurements started? Further, if one decides to move the baseline based on newer results, how long does a streak of improved performance need to be for it to not be a fluke?

Wherever a likely regression is spotted, it'd be very useful to pinpoint the exact nightly/merge/version that introduced it. But looking (for example) at the chart for uuid, it's tough to say which of the points in late January should have been marked as a regression.

The above math/statistics problems are AIUI areas of active research in time series analysis, but there are existing techniques to do them. I'm planning to do some research on them, but I still think that any benchmark suite should focus on producing noise-free data to ease that analysis process.

There's another issue to tease out in terms of benchmarking the quality of code generation vs. the standard library. Most of these benchmarks would be seeing improvements and regressions not only from rustc's behavior, but also from changes made to libstd. I think that a good benchmark suite would include a bunch of portions with #[no_std].

I had thought about trying to automate benchmarking against nightlies on a VPS machine, but after seeing how much noise there is in bare metal microbenchmarks. I'm hesitant to try that now that it's clear that the microbenchmarks typically run by cargo bench could be subject to a lot of noise.

I don't think any of the benchmarks I ran were I/O dependent, and it'd be nice to track Rust's I/O performance. But I'm not sure how this would be best achieved since there are so many variables that could affect the results of a test like that.

Since one of Rust's core advantages is parallelism, this is another important area to benchmark. But looking at the rayon benches, they seem either a) subject to greater real performance fluctuations, or b) subject to greater noise generated through (I'm guessing) kernel scheduling. Perhaps this would be a good place to compare the serial vs. concurrent implementations that the rayon benches include, and use that as a metric for comparison? I'm still not sure that would do much for controlling for scheduler interference, though.

I'm fairly happy with calculating a benchmark index by normalizing all of the runtimes against the first corresponding runtime in the series, and then taking the geometric mean of all runtimes at a given point in the series. However I don't know exactly how this behaves under different variations, so I'm not sure what good thresholds would be for warning about regressions, or even drawing a line on the graph to say "we expect points above here to be 10% slower." I think that's partly dependent on how many points are included in each mean. So if a benchmark has 20 functions and they all move 3% (probably just noise for larger tests), that might show the same change as a benchmark with 5 functions that all move 12% (a big deal!).

There are two benchmarks in the basic analysis I did which give me confidence that detecting real perf regressions is possible: permutohedron and regex. They both had visible spikes in runtime last fall, followed by returns to the previous baseline in late January. It doesn't look like they were caused in the same nightly (early September for permutohedron and late September for regex). It's also unlikely that both of those changes were caused by noise from the machine they were run on, since the benchmarks weren't run in parallel and it seems unlikely to me that something like a power saving mode would have kicked on and off at such similar points in their relative executions.

This is very long now, apologies.

TLDR

Accurately gauging what's a real performance regression is hard. Designing a benchmark suite that provides clean data for that analysis is also hard. Here are some challenges/questions that poking around some data raised for me:

  1. Simple comparisons between two benchmark numbers are unlikely to be reliable. Time series analysis across a greater number of metrics than just simple runtime is needed to give any real insight into the performance behavior of rust code. Even so, it will be difficult to put one's finger on the exact commit which introduced a regression.
  2. Any model for defining regressions should probably be back-tested against historical compiler versions.
  3. Any benchmark suite that's used as a performance canary should also probably be back-tested against historical compiler versions.
  4. Any benchmark suite that's used as a performance canary should have tests which will allow an observer to differentiate between causes from changes to rustc, libstd, llvm, etc.
  5. Correlating data from different machines and environments would be very useful.

I'm going to continue working on improving the data collection for secondstring, and I'll also be researching some of the time series stuff to see how much is feasible to implement in a given tool.

@eddyb
Copy link
Member

eddyb commented Feb 17, 2016

My 2 cents: don't measure wall-clock time in a multi-threading OS.
On Linux, getrusage's ru_utime field will measure time spent in userspace, which helps to remove I/O and scheduling noise. It's not used by libtest's #[bench] because it's not portable (AFAIK).

The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency.
But I haven't attempted any of that yet and it's probably hard, even impossible (cc @edef1c).

Another thing that @cmr actually hit was having the same binary, doing the same task, with several memory profiles (e.g. you could get 2 different graphs, that it would keep jumping between on each run).
That kept happening (in rustc) even after pinning /dev/(u)random.

I'm not sure if we were completely unaware of it back then, but ASLR can definitely cause such behavior, and not just in an allocator, but in Rust code itself: rustc hashes pointers internally, and if those aren't deterministic, the fillrate of hashmaps will be different between (otherwise identical) runs.

@anp
Copy link
Member

anp commented Feb 17, 2016

@eddyb So you see the ideal solution as one that uses a bespoke benchmarking library for insert OS here, as opposed to #[bench]? I guess I find the use of the standard toolchain's features attractive, but you're right that it limits the metrics that can be collected, as well as how reliable the results will be.

I do think that limiting non-determinism can become a bit of a rabbit-hole. If you keep at it indefinitely, you'll end up needing to design a fresh ISA. I'd argue that eventually there are diminishing returns beyond "sufficiently deterministic," whatever that actually is.

@nikomatsakis
Copy link
Contributor Author

In private communication, @BurntSushi suggested the bench_sherlock benchmark for regex. He also wrote:

There are a few different entry points to benchmarks in the regex crate, but the one you really want I think is cargo bench --bench dynamic, which benchmarks whatever Regex::new(...) does.

@anp
Copy link
Member

anp commented Feb 24, 2016

At the risk of spamming, I did try @eddyb's suggestion of using getrusage to measure time and wrote up a bit about how it went (TLDR: better than using cargo bench, but still not perfect):

https://dikaiosune.github.io/rust-runtime-cargobench-vs-getrusage.html

I also posted it to the subreddit to see if anyone not following this thread had some ideas about improving the method:

https://www.reddit.com/r/rust/comments/47dohh/measuring_rust_runtime_performance_cargo_bench_vs/

@nikomatsakis
Copy link
Contributor Author

Some preliminary work in this repo: https://github.com/nikomatsakis/rust-runtime-benchmarks

This is very simple. Basically just some directories where you can run cargo bench. I plan to add a runner that will go and execute cargo bench in each case and then save the results, and allow you to easily compare the results of various runs. Imperfect, but useful.

I will also open up some issues for adding more things. Right now it only has the "doom" and "raytracer" benchmarks. I'll do a bit of work on putting regex in there, along with something from LALRPOP, but hopefully can encourage others to add in more things. =)

@Eh2406
Copy link
Contributor

Eh2406 commented Jun 23, 2016

Is the criterion library helpfull? And more generally if we are building a tool to check for performance regressions can we make it easy for the rest of the ecosystem to use?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 16, 2017

@eddyb did you manage to run rustc with polly enabled and used by llvm opt?

I've filled an issue for this: #39884

Ideally the shipped LLVM would always be compiled with polly, and the frontend would have an option that loads it as a plugin and enables it in opt for you. And maybe another option to pass polly options to avoid writing -mllvm -polly-option_name=value over and over again.

@eddyb
Copy link
Member

eddyb commented Feb 16, 2017

@gnzlbg I never got it working, no.

@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 24, 2017
@anp
Copy link
Member

anp commented Apr 28, 2018

In case someone is subscribed here and didn't see the thread on irlo, I've been working on lolbench and discussing work over at https://internals.rust-lang.org/t/help-needed-corpus-for-measuring-runtime-performance-of-generated-code/6794/29.

@workingjubilee
Copy link
Member

Hello! What are things looking like for this nowadays?

Our plan for core::simd is likely to include some bench-testing on some reasonably realistic workloads, and it would be nice to integrate with something like this. It seems like lolbench got some mileage but then encountered hosting limits? anp/lolbench#68

@anp
Copy link
Member

anp commented Nov 3, 2020

Hey! Yeah, I chose a very "get it out the door" approach for storing the data from benchmark runs that I think will need to be redone as the first step in reviving the project.

I'd like to revive it soon for a bunch of reasons, but a few in particular:

  • I'm finding myself hyped by @eddyb's work on measureme: Use measureme for instruction counts anp/lolbench#76.
  • @tmandry's 2021 post was also a good reminder that I've let it languish a while
  • I ostensibly have sponsored hardware from packet.net for the project (although it's probably been a year since I logged in and confirmed)
  • the rust wasm ecosystem has matured significantly which makes a rust-native data explorer easier to imagine

I'm still not sure what a reasonable definition of "soon" is for me, though. Hopefully this year? If someone is motivated and wants to help get it back up and running in the meantime, let me know! I'll happily find time to write mentoring instructions.

@pnkfelix
Copy link
Member

pnkfelix commented Mar 4, 2022

discussed at today's T-compiler backlog bonanza. This is work that needs doing, but it should not be tracked here.

We're shifting the conversation over to rust-lang/rustc-perf#69 instead, and closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests