Create a benchmark suite for uncovering runtime regressions #31265

Open
nikomatsakis opened this Issue Jan 28, 2016 · 25 comments

Comments

Projects
None yet
@nikomatsakis
Contributor

nikomatsakis commented Jan 28, 2016

We need a benchmark suite targeting the runtime of generated code. I've started gathering together code samples into a repository. Here is a list of the projects I plan to extract:

In addition to curating the benchmarks themselves, we need a good way to run them. There are some tasks associated with that:

  • Put the projects in a repo with cargo setup such that cargo bench will run the relevant tests (this part I expect to get started on --nmatsakis)
  • Write a script that will execute cargo bench and extract the results into one data set
  • Provide some way to save that data set to disk and to compare against other data sets (e.g., runs of the master branch)

Eventually, I would want to integrate this into our regular benchmarking computer so that it can be put up on a website, but for now it'd be nice if you could at least run locally.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jan 28, 2016

Contributor

cc @nrc who was asking about this

Contributor

nikomatsakis commented Jan 28, 2016

cc @nrc who was asking about this

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jan 28, 2016

Contributor

Nominating to discuss in today's compiler meeting and for triage. (I guess P-medium?)

Contributor

nikomatsakis commented Jan 28, 2016

Nominating to discuss in today's compiler meeting and for triage. (I guess P-medium?)

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jan 28, 2016

Contributor

triage: P-medium

Contributor

nikomatsakis commented Jan 28, 2016

triage: P-medium

@rust-highfive rust-highfive added P-medium and removed I-nominated labels Jan 28, 2016

@durka

This comment has been minimized.

Show comment
Hide comment
@durka

durka Jan 28, 2016

Contributor

The title says "compile time regressions" but the body says "runtime of generated code". Which, or both, do you plan to measure? (I hope both)

Contributor

durka commented Jan 28, 2016

The title says "compile time regressions" but the body says "runtime of generated code". Which, or both, do you plan to measure? (I hope both)

@nrc nrc changed the title from Create a benchmark suite for uncovering compile time regressions to Create a benchmark suite for uncovering runtime regressions Jan 28, 2016

@nrc

This comment has been minimized.

Show comment
Hide comment
@nrc

nrc Jan 28, 2016

Member

The latter. The former is already measured (http://www.ncameron.org/perf-rustc/).

Member

nrc commented Jan 28, 2016

The latter. The former is already measured (http://www.ncameron.org/perf-rustc/).

@palango

This comment has been minimized.

Show comment
Hide comment
@palango

palango Jan 29, 2016

Contributor

I'd like to help here.
What do you think about about pulling the hard and medium Regexes from the link and running them on some given data. This data could be loaded from a file or generated by some constant seed Rng. The first would probably be somehow IO dependent while the latter would test the Rng as well.

Contributor

palango commented Jan 29, 2016

I'd like to help here.
What do you think about about pulling the hard and medium Regexes from the link and running them on some given data. This data could be loaded from a file or generated by some constant seed Rng. The first would probably be somehow IO dependent while the latter would test the Rng as well.

@MagaTailor

This comment has been minimized.

Show comment
Hide comment
@MagaTailor

MagaTailor Jan 29, 2016

Great idea, especially something to stress jemalloc ;)

Great idea, especially something to stress jemalloc ;)

@dirk

This comment has been minimized.

Show comment
Hide comment
@dirk

dirk Jan 30, 2016

Contributor

@nrc: What format are you using for storing the data for that perf-rustc site? (That site is pretty great by the way!) My thought is that that could be a starting-point for achieving the following:

Write a script that will execute cargo bench and extract the results into one data set

Contributor

dirk commented Jan 30, 2016

@nrc: What format are you using for storing the data for that perf-rustc site? (That site is pretty great by the way!) My thought is that that could be a starting-point for achieving the following:

Write a script that will execute cargo bench and extract the results into one data set

@nrc

This comment has been minimized.

Show comment
Hide comment
@nrc

nrc Feb 1, 2016

Member

@dirk The data is stored here: https://github.com/nrc/rustc-timing/tree/master/processed JSON is the short answer.

Member

nrc commented Feb 1, 2016

@dirk The data is stored here: https://github.com/nrc/rustc-timing/tree/master/processed JSON is the short answer.

@hexsel

This comment has been minimized.

Show comment
Hide comment
@hexsel

hexsel Feb 1, 2016

I know this may be too complex for an ongoing, reliable test, but it would be awesome to have something like a small-but-not-trivial web project (Iron? Nickel?). This would exercise lots of string-related operations, a little bit of IO (I assume the request and response should be very small in size), context switching, and linking.

Unfortunately, this would require the benchmarks to be moved to a separate project.

hexsel commented Feb 1, 2016

I know this may be too complex for an ongoing, reliable test, but it would be awesome to have something like a small-but-not-trivial web project (Iron? Nickel?). This would exercise lots of string-related operations, a little bit of IO (I assume the request and response should be very small in size), context switching, and linking.

Unfortunately, this would require the benchmarks to be moved to a separate project.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Feb 2, 2016

Contributor

As an aside, some kind of "benchmark monitoring" solution is a common need (e.g., I know I want the same thing in Rayon). It'd be great if we could write a tool that is easily extensible beyond rustc -- it seems eminently doable.

Contributor

nikomatsakis commented Feb 2, 2016

As an aside, some kind of "benchmark monitoring" solution is a common need (e.g., I know I want the same thing in Rayon). It'd be great if we could write a tool that is easily extensible beyond rustc -- it seems eminently doable.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 11, 2016

Contributor

@nikomatsakis I'd be interested in helping out with this effort overall, but I also wanted to ask about a somewhat related project I started after reading your comment above.

I did a little hacking on this in the last couple of days:

https://github.com/dikaiosune/secondstring

It's really crappy still and will need more work to properly do the monitoring you're talking about, but I'm curious if you think my approach so far would be a good fit for the "benchmark monitoring" in your comment above. Soon I'll be implementing saving the results to disk and diffing all benchmark runs against the previously saved results so that benchmarks can be run incrementally. Do you think it seems like it would be useful for this effort? What would be needed for a proper benchmark monitoring tool as it relates this this issue?

EDIT: The incremental benchmarking seems to be mostly working now. Mostly.

Contributor

anp commented Feb 11, 2016

@nikomatsakis I'd be interested in helping out with this effort overall, but I also wanted to ask about a somewhat related project I started after reading your comment above.

I did a little hacking on this in the last couple of days:

https://github.com/dikaiosune/secondstring

It's really crappy still and will need more work to properly do the monitoring you're talking about, but I'm curious if you think my approach so far would be a good fit for the "benchmark monitoring" in your comment above. Soon I'll be implementing saving the results to disk and diffing all benchmark runs against the previously saved results so that benchmarks can be run incrementally. Do you think it seems like it would be useful for this effort? What would be needed for a proper benchmark monitoring tool as it relates this this issue?

EDIT: The incremental benchmarking seems to be mostly working now. Mostly.

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Feb 15, 2016

Member

It might be interesting to see if polly has any effects on our testcases.

Member

eddyb commented Feb 15, 2016

It might be interesting to see if polly has any effects on our testcases.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Feb 16, 2016

Member

For regex, you probably want benches/bench.rs and benches/bench_sherlock.rs. They can be run with cargo bench --bench dynamic. (The same benchmarks run on other configurations of the regex, but the dynamic bench is the primary one that benchmarks Regex::new.)

Member

BurntSushi commented Feb 16, 2016

For regex, you probably want benches/bench.rs and benches/bench_sherlock.rs. They can be run with cargo bench --bench dynamic. (The same benchmarks run on other configurations of the regex, but the dynamic bench is the primary one that benchmarks Regex::new.)

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 17, 2016

Contributor

@nrc asked me to share some of my thoughts after doing the above analysis. I still think that that depth of the linked analysis is too shallow to be truly useful, but it did highlight a few questions for me. I don't think all of these need to be answered for a good benchmark suite, but some of them are tough nuts to crack and I'd argue that all of them are important for having a really good automated regression detection protocol.

The core problem I see after messing around with these data: it's really tough to identify the difference between a regression and a noisy measurement (or a series of noisy measurements). I imagine that a good improvement to what I did would be a combination of repeated measurements across different environments and more detailed data collection than just runtimes in nanoseconds. @cmr (/u/cmrx64 on reddit, I think that's the same person) suggested a more in depth list of metrics to track:

code size, number of basic blocks and their sizes, branch mispredicts, cache misses, or instructions executed, and maybe where the most time is spent

I come from Java and Python, so I'm not sure how to collect those metrics. I also don't know if that's possible without debug information, which as I understand it does not currently get included when LLVM optimizations are on (relevant since one probably doesn't care about performance regressions in debug code). If it is possible I would be very interested in learning how to do so for Rust executables so I can incorporate those measurements into secondstring. I'm also thinking that I should have run cargo bench multiple times in a row for each benchmark, as well as used the +/- margin reported to create confidence intervals in the charts.

Any method for automatically detecting regressions needs to be sensitive to the historical noise in previous benchmarks. If you look at the benchmarks linked above for byteorder vs. csv, 5% variations would be a big deal for the former, but not for the latter. Along the same lines, a good regression detection tool would account for long-term increases that happen very incrementally. If, for example, the clap benchmarks from November to mid-January represent an actual regression, each uptick in runtime would have been very difficult to tell apart from noise at the time when those nightlies were being released.

Another challenge of treating the data "in context": what is the baseline you measure against? Does that baseline ever get updated? For example, looking at the chart for the suffix crate (assuming those runtimes are representative), should the beginning of January be the new baseline? Or should one only care about regressions above the point where measurements started? Further, if one decides to move the baseline based on newer results, how long does a streak of improved performance need to be for it to not be a fluke?

Wherever a likely regression is spotted, it'd be very useful to pinpoint the exact nightly/merge/version that introduced it. But looking (for example) at the chart for uuid, it's tough to say which of the points in late January should have been marked as a regression.

The above math/statistics problems are AIUI areas of active research in time series analysis, but there are existing techniques to do them. I'm planning to do some research on them, but I still think that any benchmark suite should focus on producing noise-free data to ease that analysis process.

There's another issue to tease out in terms of benchmarking the quality of code generation vs. the standard library. Most of these benchmarks would be seeing improvements and regressions not only from rustc's behavior, but also from changes made to libstd. I think that a good benchmark suite would include a bunch of portions with #[no_std].

I had thought about trying to automate benchmarking against nightlies on a VPS machine, but after seeing how much noise there is in bare metal microbenchmarks. I'm hesitant to try that now that it's clear that the microbenchmarks typically run by cargo bench could be subject to a lot of noise.

I don't think any of the benchmarks I ran were I/O dependent, and it'd be nice to track Rust's I/O performance. But I'm not sure how this would be best achieved since there are so many variables that could affect the results of a test like that.

Since one of Rust's core advantages is parallelism, this is another important area to benchmark. But looking at the rayon benches, they seem either a) subject to greater real performance fluctuations, or b) subject to greater noise generated through (I'm guessing) kernel scheduling. Perhaps this would be a good place to compare the serial vs. concurrent implementations that the rayon benches include, and use that as a metric for comparison? I'm still not sure that would do much for controlling for scheduler interference, though.

I'm fairly happy with calculating a benchmark index by normalizing all of the runtimes against the first corresponding runtime in the series, and then taking the geometric mean of all runtimes at a given point in the series. However I don't know exactly how this behaves under different variations, so I'm not sure what good thresholds would be for warning about regressions, or even drawing a line on the graph to say "we expect points above here to be 10% slower." I think that's partly dependent on how many points are included in each mean. So if a benchmark has 20 functions and they all move 3% (probably just noise for larger tests), that might show the same change as a benchmark with 5 functions that all move 12% (a big deal!).

There are two benchmarks in the basic analysis I did which give me confidence that detecting real perf regressions is possible: permutohedron and regex. They both had visible spikes in runtime last fall, followed by returns to the previous baseline in late January. It doesn't look like they were caused in the same nightly (early September for permutohedron and late September for regex). It's also unlikely that both of those changes were caused by noise from the machine they were run on, since the benchmarks weren't run in parallel and it seems unlikely to me that something like a power saving mode would have kicked on and off at such similar points in their relative executions.

This is very long now, apologies.

TLDR

Accurately gauging what's a real performance regression is hard. Designing a benchmark suite that provides clean data for that analysis is also hard. Here are some challenges/questions that poking around some data raised for me:

  1. Simple comparisons between two benchmark numbers are unlikely to be reliable. Time series analysis across a greater number of metrics than just simple runtime is needed to give any real insight into the performance behavior of rust code. Even so, it will be difficult to put one's finger on the exact commit which introduced a regression.
  2. Any model for defining regressions should probably be back-tested against historical compiler versions.
  3. Any benchmark suite that's used as a performance canary should also probably be back-tested against historical compiler versions.
  4. Any benchmark suite that's used as a performance canary should have tests which will allow an observer to differentiate between causes from changes to rustc, libstd, llvm, etc.
  5. Correlating data from different machines and environments would be very useful.

I'm going to continue working on improving the data collection for secondstring, and I'll also be researching some of the time series stuff to see how much is feasible to implement in a given tool.

Contributor

anp commented Feb 17, 2016

@nrc asked me to share some of my thoughts after doing the above analysis. I still think that that depth of the linked analysis is too shallow to be truly useful, but it did highlight a few questions for me. I don't think all of these need to be answered for a good benchmark suite, but some of them are tough nuts to crack and I'd argue that all of them are important for having a really good automated regression detection protocol.

The core problem I see after messing around with these data: it's really tough to identify the difference between a regression and a noisy measurement (or a series of noisy measurements). I imagine that a good improvement to what I did would be a combination of repeated measurements across different environments and more detailed data collection than just runtimes in nanoseconds. @cmr (/u/cmrx64 on reddit, I think that's the same person) suggested a more in depth list of metrics to track:

code size, number of basic blocks and their sizes, branch mispredicts, cache misses, or instructions executed, and maybe where the most time is spent

I come from Java and Python, so I'm not sure how to collect those metrics. I also don't know if that's possible without debug information, which as I understand it does not currently get included when LLVM optimizations are on (relevant since one probably doesn't care about performance regressions in debug code). If it is possible I would be very interested in learning how to do so for Rust executables so I can incorporate those measurements into secondstring. I'm also thinking that I should have run cargo bench multiple times in a row for each benchmark, as well as used the +/- margin reported to create confidence intervals in the charts.

Any method for automatically detecting regressions needs to be sensitive to the historical noise in previous benchmarks. If you look at the benchmarks linked above for byteorder vs. csv, 5% variations would be a big deal for the former, but not for the latter. Along the same lines, a good regression detection tool would account for long-term increases that happen very incrementally. If, for example, the clap benchmarks from November to mid-January represent an actual regression, each uptick in runtime would have been very difficult to tell apart from noise at the time when those nightlies were being released.

Another challenge of treating the data "in context": what is the baseline you measure against? Does that baseline ever get updated? For example, looking at the chart for the suffix crate (assuming those runtimes are representative), should the beginning of January be the new baseline? Or should one only care about regressions above the point where measurements started? Further, if one decides to move the baseline based on newer results, how long does a streak of improved performance need to be for it to not be a fluke?

Wherever a likely regression is spotted, it'd be very useful to pinpoint the exact nightly/merge/version that introduced it. But looking (for example) at the chart for uuid, it's tough to say which of the points in late January should have been marked as a regression.

The above math/statistics problems are AIUI areas of active research in time series analysis, but there are existing techniques to do them. I'm planning to do some research on them, but I still think that any benchmark suite should focus on producing noise-free data to ease that analysis process.

There's another issue to tease out in terms of benchmarking the quality of code generation vs. the standard library. Most of these benchmarks would be seeing improvements and regressions not only from rustc's behavior, but also from changes made to libstd. I think that a good benchmark suite would include a bunch of portions with #[no_std].

I had thought about trying to automate benchmarking against nightlies on a VPS machine, but after seeing how much noise there is in bare metal microbenchmarks. I'm hesitant to try that now that it's clear that the microbenchmarks typically run by cargo bench could be subject to a lot of noise.

I don't think any of the benchmarks I ran were I/O dependent, and it'd be nice to track Rust's I/O performance. But I'm not sure how this would be best achieved since there are so many variables that could affect the results of a test like that.

Since one of Rust's core advantages is parallelism, this is another important area to benchmark. But looking at the rayon benches, they seem either a) subject to greater real performance fluctuations, or b) subject to greater noise generated through (I'm guessing) kernel scheduling. Perhaps this would be a good place to compare the serial vs. concurrent implementations that the rayon benches include, and use that as a metric for comparison? I'm still not sure that would do much for controlling for scheduler interference, though.

I'm fairly happy with calculating a benchmark index by normalizing all of the runtimes against the first corresponding runtime in the series, and then taking the geometric mean of all runtimes at a given point in the series. However I don't know exactly how this behaves under different variations, so I'm not sure what good thresholds would be for warning about regressions, or even drawing a line on the graph to say "we expect points above here to be 10% slower." I think that's partly dependent on how many points are included in each mean. So if a benchmark has 20 functions and they all move 3% (probably just noise for larger tests), that might show the same change as a benchmark with 5 functions that all move 12% (a big deal!).

There are two benchmarks in the basic analysis I did which give me confidence that detecting real perf regressions is possible: permutohedron and regex. They both had visible spikes in runtime last fall, followed by returns to the previous baseline in late January. It doesn't look like they were caused in the same nightly (early September for permutohedron and late September for regex). It's also unlikely that both of those changes were caused by noise from the machine they were run on, since the benchmarks weren't run in parallel and it seems unlikely to me that something like a power saving mode would have kicked on and off at such similar points in their relative executions.

This is very long now, apologies.

TLDR

Accurately gauging what's a real performance regression is hard. Designing a benchmark suite that provides clean data for that analysis is also hard. Here are some challenges/questions that poking around some data raised for me:

  1. Simple comparisons between two benchmark numbers are unlikely to be reliable. Time series analysis across a greater number of metrics than just simple runtime is needed to give any real insight into the performance behavior of rust code. Even so, it will be difficult to put one's finger on the exact commit which introduced a regression.
  2. Any model for defining regressions should probably be back-tested against historical compiler versions.
  3. Any benchmark suite that's used as a performance canary should also probably be back-tested against historical compiler versions.
  4. Any benchmark suite that's used as a performance canary should have tests which will allow an observer to differentiate between causes from changes to rustc, libstd, llvm, etc.
  5. Correlating data from different machines and environments would be very useful.

I'm going to continue working on improving the data collection for secondstring, and I'll also be researching some of the time series stuff to see how much is feasible to implement in a given tool.

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Feb 17, 2016

Member

My 2 cents: don't measure wall-clock time in a multi-threading OS.
On Linux, getrusage's ru_utime field will measure time spent in userspace, which helps to remove I/O and scheduling noise. It's not used by libtest's #[bench] because it's not portable (AFAIK).

The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency.
But I haven't attempted any of that yet and it's probably hard, even impossible (cc @edef1c).

Another thing that @cmr actually hit was having the same binary, doing the same task, with several memory profiles (e.g. you could get 2 different graphs, that it would keep jumping between on each run).
That kept happening (in rustc) even after pinning /dev/(u)random.

I'm not sure if we were completely unaware of it back then, but ASLR can definitely cause such behavior, and not just in an allocator, but in Rust code itself: rustc hashes pointers internally, and if those aren't deterministic, the fillrate of hashmaps will be different between (otherwise identical) runs.

Member

eddyb commented Feb 17, 2016

My 2 cents: don't measure wall-clock time in a multi-threading OS.
On Linux, getrusage's ru_utime field will measure time spent in userspace, which helps to remove I/O and scheduling noise. It's not used by libtest's #[bench] because it's not portable (AFAIK).

The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency.
But I haven't attempted any of that yet and it's probably hard, even impossible (cc @edef1c).

Another thing that @cmr actually hit was having the same binary, doing the same task, with several memory profiles (e.g. you could get 2 different graphs, that it would keep jumping between on each run).
That kept happening (in rustc) even after pinning /dev/(u)random.

I'm not sure if we were completely unaware of it back then, but ASLR can definitely cause such behavior, and not just in an allocator, but in Rust code itself: rustc hashes pointers internally, and if those aren't deterministic, the fillrate of hashmaps will be different between (otherwise identical) runs.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 17, 2016

Contributor

@eddyb So you see the ideal solution as one that uses a bespoke benchmarking library for insert OS here, as opposed to #[bench]? I guess I find the use of the standard toolchain's features attractive, but you're right that it limits the metrics that can be collected, as well as how reliable the results will be.

I do think that limiting non-determinism can become a bit of a rabbit-hole. If you keep at it indefinitely, you'll end up needing to design a fresh ISA. I'd argue that eventually there are diminishing returns beyond "sufficiently deterministic," whatever that actually is.

Contributor

anp commented Feb 17, 2016

@eddyb So you see the ideal solution as one that uses a bespoke benchmarking library for insert OS here, as opposed to #[bench]? I guess I find the use of the standard toolchain's features attractive, but you're right that it limits the metrics that can be collected, as well as how reliable the results will be.

I do think that limiting non-determinism can become a bit of a rabbit-hole. If you keep at it indefinitely, you'll end up needing to design a fresh ISA. I'd argue that eventually there are diminishing returns beyond "sufficiently deterministic," whatever that actually is.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Feb 23, 2016

Contributor

In private communication, @BurntSushi suggested the bench_sherlock benchmark for regex. He also wrote:

There are a few different entry points to benchmarks in the regex crate, but the one you really want I think is cargo bench --bench dynamic, which benchmarks whatever Regex::new(...) does.

Contributor

nikomatsakis commented Feb 23, 2016

In private communication, @BurntSushi suggested the bench_sherlock benchmark for regex. He also wrote:

There are a few different entry points to benchmarks in the regex crate, but the one you really want I think is cargo bench --bench dynamic, which benchmarks whatever Regex::new(...) does.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 24, 2016

Contributor

At the risk of spamming, I did try @eddyb's suggestion of using getrusage to measure time and wrote up a bit about how it went (TLDR: better than using cargo bench, but still not perfect):

https://dikaiosune.github.io/rust-runtime-cargobench-vs-getrusage.html

I also posted it to the subreddit to see if anyone not following this thread had some ideas about improving the method:

https://www.reddit.com/r/rust/comments/47dohh/measuring_rust_runtime_performance_cargo_bench_vs/

Contributor

anp commented Feb 24, 2016

At the risk of spamming, I did try @eddyb's suggestion of using getrusage to measure time and wrote up a bit about how it went (TLDR: better than using cargo bench, but still not perfect):

https://dikaiosune.github.io/rust-runtime-cargobench-vs-getrusage.html

I also posted it to the subreddit to see if anyone not following this thread had some ideas about improving the method:

https://www.reddit.com/r/rust/comments/47dohh/measuring_rust_runtime_performance_cargo_bench_vs/

@anp anp referenced this issue in anp/rfcbot-rs Apr 13, 2016

Closed

Runtime benchmarks #3

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jun 6, 2016

Contributor

Some preliminary work in this repo: https://github.com/nikomatsakis/rust-runtime-benchmarks

This is very simple. Basically just some directories where you can run cargo bench. I plan to add a runner that will go and execute cargo bench in each case and then save the results, and allow you to easily compare the results of various runs. Imperfect, but useful.

I will also open up some issues for adding more things. Right now it only has the "doom" and "raytracer" benchmarks. I'll do a bit of work on putting regex in there, along with something from LALRPOP, but hopefully can encourage others to add in more things. =)

Contributor

nikomatsakis commented Jun 6, 2016

Some preliminary work in this repo: https://github.com/nikomatsakis/rust-runtime-benchmarks

This is very simple. Basically just some directories where you can run cargo bench. I plan to add a runner that will go and execute cargo bench in each case and then save the results, and allow you to easily compare the results of various runs. Imperfect, but useful.

I will also open up some issues for adding more things. Right now it only has the "doom" and "raytracer" benchmarks. I'll do a bit of work on putting regex in there, along with something from LALRPOP, but hopefully can encourage others to add in more things. =)

@Eh2406

This comment has been minimized.

Show comment
Hide comment
@Eh2406

Eh2406 Jun 23, 2016

Contributor

Is the criterion library helpfull? And more generally if we are building a tool to check for performance regressions can we make it easy for the rest of the ecosystem to use?

Contributor

Eh2406 commented Jun 23, 2016

Is the criterion library helpfull? And more generally if we are building a tool to check for performance regressions can we make it easy for the rest of the ecosystem to use?

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Feb 16, 2017

Contributor

@eddyb did you manage to run rustc with polly enabled and used by llvm opt?

I've filled an issue for this: #39884

Ideally the shipped LLVM would always be compiled with polly, and the frontend would have an option that loads it as a plugin and enables it in opt for you. And maybe another option to pass polly options to avoid writing -mllvm -polly-option_name=value over and over again.

Contributor

gnzlbg commented Feb 16, 2017

@eddyb did you manage to run rustc with polly enabled and used by llvm opt?

I've filled an issue for this: #39884

Ideally the shipped LLVM would always be compiled with polly, and the frontend would have an option that loads it as a plugin and enables it in opt for you. And maybe another option to pass polly options to avoid writing -mllvm -polly-option_name=value over and over again.

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Feb 16, 2017

Member

@gnzlbg I never got it working, no.

Member

eddyb commented Feb 16, 2017

@gnzlbg I never got it working, no.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Apr 28, 2018

Contributor

In case someone is subscribed here and didn't see the thread on irlo, I've been working on lolbench and discussing work over at https://internals.rust-lang.org/t/help-needed-corpus-for-measuring-runtime-performance-of-generated-code/6794/29.

Contributor

anp commented Apr 28, 2018

In case someone is subscribed here and didn't see the thread on irlo, I've been working on lolbench and discussing work over at https://internals.rust-lang.org/t/help-needed-corpus-for-measuring-runtime-performance-of-generated-code/6794/29.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment