Skip to content
This repository

Eliminate and outlaw microbenchmarks #67

Open
jlebar opened this Issue September 25, 2012 · 14 comments

6 participants

Justin Lebar jkomoros nnethercote Paul Lewis Blake Winton John-David Dalton
Justin Lebar

Robohornet currently has a bunch of microbenchmarks.

It is well-known that microbenchmarks are a poor way to measure performance.

If you guys want us (in my case, Mozilla) to take robohornet seriously, I strongly recommend you write some macrobenchmarks and eliminate the microbenchmarks from your test suite.

Indeed, if the idea of this suite is that it "measures things web developers actually care about", then surely a benchmark which actually runs handlebars.js would be more sensible than a benchmark which calculates primes and posits that's related to the speed of handlebars.js.

I really like the idea of a community-driven benchmark. I hope that aspect of this project, rather than the microbenchmarks, becomes the hallmark of robohornet.

Fellow Mozilla hacker Nicholas Nethercote wrote a bit more about this on HN. Apparently one of the robohornet benchmarks does nothing at all? Ouch. http://news.ycombinator.com/item?id=4567796

jkomoros
Owner

Thanks for your comments! The reason we open sourced RoboHornet in this extremely early state is because we wanted to get feedback from folks on the direction and implementation. There have been a number of great discussions about some of the tests in the early preview (issue #54 is a great example, and directly relevant to the one you mentioned). This is exactly the kind of feedback we wanted to get so we can make RoboHornet as good as possible.

I just added an answer in the FAQ about "micro-benchmarks": https://github.com/robohornet/robohornet/wiki/FAQ . In summary, although these are small, succinct tests, our hope is that they don't have the failings normally associated with "microbenchmarks" because they are motivated by specific pain points and the tests can evolve with browsers to ensure that they continue testing what we want them to test. Of course, there are problems with the specific tests today, which we want to improve as quickly as possible now that we have good feedback on them.

One thing we struggled with was designing a benchmark suite based on specific developer pain points that used "macro" test cases, which can be extremely general. Do you have thoughts on how we could go about harnessing the community's pain points via macro test cases?

Justin Lebar

our hope is that they don't have the failings normally associated with "microbenchmarks" because they are
motivated by specific pain points and the tests can evolve with browsers to ensure that they continue testing what
we want them to test.

The thing is, how do you know that X is a pain point in my particular browser unless you have a macrobenchmark? If you have some larger application which is slow because of X, then great -- let's benchmark that. If you don't have a larger application, then on what basis can you claim that X is a problem?

Moreover, even if we know that X is a pain point in your larger application, there's no guarantee that a microbenchmark of X accurately reflects the speed of X when used in your larger application.

If microbenchmarking were feasible in the way you describe, I wouldn't think that Google's JS team would go through all the effort to introduce new macrobenchmarks to the V8 suite. They'd just test the existing "pain points" and be done with it.

I really encourage you to read the some of the literature on perf testing (as Nicholas Nethercote pointed out, Hennessy and Patterson is a good starting point), or to study some benchmarks which are considered good and bad. It's really tempting to think that you're clever enough to design good microbenchmarks, where others have failed. But if anything, the current state of RoboHornet shows just how hard this is.

Do you have any thoughts on how we could go about harnessing the community's pain points via macro test cases?

Sure. "handlebars.js is slow. On my site, the following handlebars.js script takes 500ms to run, but the handlebars.js guys say there's nothing they could do in their code to make it faster." Maybe that's the start of benchmark. Who cares if the critical path in handlebars is adding elements to a table by row, or generating primes, or whatever.

Now, your response might be "I profiled handlebars.js and saw that the perf problem was coming from the fact that it's generating primes and that's slow, so I'm going to write a benchmark which generates primes." But that is totally the wrong way to do it, for a few reasons:

  • You're not going to re-profile your entire test suite every time someone releases a new browser, to see if generating primes is still the slow part of handlebars.js.

  • Even supposing you were totally on top of changes in browser perf and modify the benchmark every time a given microbenchmark stops being representative of the bigger macro picture, if your tests have to be changed frequently, your benchmark isn't particularly useful. In particular, a benchmark which changes frequently can always crown a "fastest browser", but that's not particularly useful to anyone. What browser vendors need to optimize their code and prevent regressions is to compare benchmark results across time. But if I have to go back and re-generate a year's worth of perf data every three months when the benchmark changes, that makes the benchmark much less useful to me.

  • Even supposing you changed the tests with just the right frequency -- not too often to be annoying, but not too infrequently such that the microbenchmarks stop being relevant -- there's no guarantee that a pain point in one browser is a pain point in another browser. So (say) IE and Chrome will compete for dominance in microbenchmark X when X is actually a perf problem only in Firefox, and IE and Chrome actually have a completely different critical path in your macrobenchmark. That's a waste of everyone's time.

  • As a browser developer, I'm going to be much less motivated to care about a perf issue if I have to take your word that it's a macro perf problem and not just a perf problem in microbenchmark you wrote. If on the other hand you can prove to me that the benchmark demonstrates a real-world perf issue (by actually benchmarking real-world code), I'll be much more inclined to investigate.

nnethercote

Unsurprisingly, I agree 100% with Justin. Microbenchmarks are not a suitable foundation for high quality benchmarking. The hardware and C/C++/Fortran people learnt this in the 1980s and have been getting it right with the SPEC benchmarks (http://www.spec.org/) since the 1990s. That experience is yet to migrate over the web world, unfortunately.

If I had to boil down the situation to a single principle it would be the following. If you care about the performance of X, then measure the performance of X. Do not measure the performance of Y, where you think that Y is representative of or correlates with X, because there's a very good chance that it doesn't. It sounds trivial when written down like that, but people get this wrong more often than not.

Paul Lewis
Collaborator

@nnethercote I agree that Y does not necessarily correlate to X, and that in simple terms to measure Y if you mean to measure X is incorrect.

Where I'm coming unstuck is this: if, say, I'm creating an HTML5 game and I can get my draw code happening inside 16ms, but adding code to clear the canvas slows things down significantly, then that's a specific pain point that needs fixing. (Forgive me for bringing this to a single API, I just find it easier to talk about an actual experience rather than generalised approaches.) If I understand your point correctly, and apply it in this example, you would want to know if drawing to the canvas (in general terms) is slower or faster in a specific browser.

I think this would give an overly generalised result as the conclusion I'd draw would be a browser can offset a slow op by simply making everything else faster. The problem there is that clearing is non-negotiable op for a developer in many cases, i.e. in the situation they're compositing the result over other elements, whereas drawing can be tailored by the developer.

I'm disinclined to entirely discard all microbenchmarks because in certain situations a macrobenchmark can cover over an unduly slow API call. However, to your point, microbenchmarks alone do not show all that much by themselves, so I'd personally like to see combination of both, i.e. a macrobenchmark with a detailed breakdown.

Blake Winton

I wonder if it would be possible to only have the macrobenchmarks, but show which functions were taking the most time within those (on a per-browser basis, naturally), so that all the browser makers would concentrate on making the whole thing fast but would have some idea of what they needed to target, and so that developers could see what the slow(-est) parts of the macrobenchmarks were, and attempt to avoid them if possible…

For your example, Paul, how would you feel about having two macrobenchmarks, one with the call to clear the canvas, and one without?

Justin Lebar

The conclusion I'd draw would be a browser can offset a slow op by simply making everything else faster.

It sounds like you're uncomfortable with this, but it is exactly true!

The goal when designing a browser (or a CPU, for that matter) is not to make all of the individual operations as fast as possible. Instead, the goal is to design a browser or CPU which executes programs quickly.

(As an example: A modern CPU will not execute quickly a program which is an unbroken series of indirect branches. But I'm not going to write a benchmark which times nothing but indirect branch performance -- even though indirect branch perf is in fact something I care about, because it affects the speed of virtual function calls -- because useful programs don't just make a million virtual function calls and then quit. If AMD speeds up indirect branch calls and Intel speeds up something else, well, that's nice, but I can't declare a winner until I run a macrobenchmark which is representative of the workload I care about.)

Designing a browser/CPU to execute programs quickly usually means optimizing for common operations, while allowing uncommon ops to be slow. We don't waste or time optimizing operations which are uncommon in the real world, because Amdahl's law shows that's not how we make code run fast.

Under what concrete, real-world situation do you care how long specifically clearing a canvas takes? If there's a situation where the amount of time that takes matters much more than the amount of time that drawing to the canvas takes, then let's benchmark that. But if as soon as you draw something non-trivial to the canvas the cost of drawing greatly exceeds the cost of clearing, then browser vendors have done the right thing by not optimizing our canvas-clearing code.

I wonder if it would be possible to only have the macrobenchmarks, but show which functions were taking the most
time within those (on a per-browser basis, naturally), so that all the browser makers would concentrate on making
the whole thing fast but would have some idea of what they needed to target

The way browser developers figure out what parts are slow is by profiling the browser. From a cursory search, there appear to be a number of JS profilers, which would give exactly the information that a web developer would want in this case.

Even better than profiling a benchmark, you can profile your own code and find out exactly why it's slow. But I don't disagree that providing some general guidance (X function is slow in browser Z, use Y function instead) to web developers could be useful.

nnethercote

jlebar is exactly correct. Going back to Paul's example: "I'm creating an HTML5 game and I can get my draw code happening inside 16ms, but adding code to clear the canvas slows things down significantly, then that's a specific pain point that needs fixing. "

Let's assume that clearing the canvas slows it down from 16ms to 24ms. Let's assume that this slowdown is consistent across all browsers. Let's also assume you can create a microbenchmark that accurately measures canvas clearing (already we've seen this is far harder than it first appears). Let's assume you put that in RoboHornet, and that all the browser vendors respond by improving canvas clearing speed so that canvas clearing costs zero time in your game.

That's a lot of assumptions, but hey, this is a though experiment. We've made progress! Your game is now faster. But if the entire game had been used as the benchmark, the browser vendors could investigate how to speed up other parts of the draw code. Perhaps that 16ms could have been reduced to 8ms. By pre-emptively deciding which parts are important to optimize, you've precluded the possibility of optimizing other parts.

Coming back to my previous advice: your game is X. The canvas clearing operation is Y. Measure X, don't measure Y.

I recommend reading section 1.8 (pages 36--44) of H&P's "Computer Architecture" -- you can do so at http://www.amazon.com/Computer-Architecture-Fifth-Quantitative-Approach/dp/012383872X with the "Click to look inside" feature. What we've been calling "microbenchmarks" they call "kernels", which they describe as "small, key pieces of real applications". (The "calculate primes" test, in contrast, is an example of what they call a "toy program".) H&P describe both kernels and toy programs as "discredited" for benchmarking purposes.

The five new benchmarks added to Google's Octane benchmark suite are an example worth following -- they're all real, large, cutting edge apps. (Yes, they're JS-only, but they're an example worth emulating.) I wrote about this at https://blog.mozilla.org/nnethercote/2012/08/24/octane-minus-v8/.

Another problem with microbenchmarks that we haven't covered so far is that they are easy to game. For example, did you know that all of the top 5 browsers's JS engines have a cache that stores the offset for daylight savings time? This is completely useless in the real world, but it helps enormously with one (or is it two?) of the tests in SunSpider which is a microbenchmark that does bajillions of date manipulations. I think Firefox was the last browser to implement this; we resisted for a long time but ultimately we had to do it to be competitive on SunSpider, which the tech press obsesses over. Bad browser benchmarks hurt the web.

Another case: V8's "regexp" benchmark does lots of RegExp.exec() calls. It uses regexps culled from real-world sites, which is good. Unfortunately, it never looks at the result of any of these calls. Therefore, if you detect this you can avoid constructing the return value (which is an array, and thus expensive to build) and get a much better score. This is of little use in real-world code, however.

Justin Lebar

For example, did you know that all of the top 5 browsers's JS engines have a cache that stores the offset for
daylight savings time?

Actually, this is the source of a bug in Firefox: Once we populate this cache we never check if your system timezone has changed, so if the timezone changes while Firefox is running, I believe all your Date objects are incorrect (until you restart the browser).

We're fixing this for B2G and I hope we'll fix it on other operating systems after that, but it's sad that we had to implement an unsound optimization to compete on a useless microbenchmark.

https://bugzilla.mozilla.org/show_bug.cgi?id=790499

jkomoros
Owner

Sorry for the delayed response; I've been traveling. This kind of discussion is exactly the kind of thing we were hoping to encourage by open sourcing RoboHornet in an extremely early state, and ultimately will help us build the best benchmark we can that's community driven and centered around real-world pain points. I want to again thank you guys for having such an active discussion here and helping us improve.

You guys make a compelling argument why micro-benchmarks are hard (and perhaps impossible) to do perfectly. Based on you argument, though, I think there might still be room for RoboHornet's current basic approach. It seems to me (and I might be entirely wrong) that your points are primarily addressed at areas where performance is already pretty good (like JavaScript), whereas RoboHornet aims to improve the parts of browser performance that web developers actually care most about now a days, which in many cases is everything else: DOM APIs, layout, canvas drawing, etc. There's some low-hanging fruit on the platform that we'd love to see made fast. For the purposes of this discussion, I'd like to ignore the current state of the the tests in the suite—they're rough, we know—and instead focus on the future goals.

We started RoboHornet because we noticed that there were many APIs in the web platform that we relied on that went too slow. The existing benchmarks didn't do a good job of exercising them and browser vendors historically don't do a great job of listening to performance feedback from web developers unless there's a benchmark attached. So we decided the way to fix it was to come up with our own benchmark—a way for web developers to have an organized voice that browser vendors would actually listen to. The fact that we're having such an active discussion on this thread is a very positive sign. :-)

Let's take these in pieces...

Now, your response might be "I profiled handlebars.js and saw that the perf problem was coming from the fact that it's generating primes and that's slow, so I'm going to write a benchmark which generates primes." But that is totally the wrong way to do it, for a few reasons:

Your point makes sense for primes (that test will be removed/replaced very soon), but does that make sense for, say, the fact that localStorage (to take a random example) is slow? Multiple libraries and apps want to use localStorage—often in very different ways—and in all of them, localStorage is too slow today. Surely including an example of every single app that goes slow because of localStorage isn't feasible. Could it be reasonable to try to come up with a composite test that captures the commonalities of the cases where it's slow? At the beginning when the APIs haven't been even trivially optimized, even simple tests will capture the pain well. Over time as browsers get better and better the tests may need to become far more complex (likely to the point where folks would call them "macro benchmarks"). At some point the performance will be good enough that web developers are happy because their real world performance problems are fixed on that issue; at that point it's fine to not continue iterating on the test. Again, we're mainly trying to focus attention of browser vendors on low-hanging fruit they might otherwise ignore.

But if I have to go back and re-generate a year's worth of perf data every three months when the benchmark changes, that makes the benchmark much less useful to me.

Making the benchmark easy to run for browser engineers is definitely more important than most people think—it's one of the reasons we made it so you can easily run a single test from the harness (although there may be some problems with that implementation at the moment). But remember that web developers are dealing with the pain of slow parts of the platform every day as we write apps. A static benchmark wouldn't work because there are lots of things in the web platform that we want to be fast, and new ones coming every day. There should be a happy middle ground here, which we tried to hit with a rough four-times-a-year release schedule. Do you feel that hits the sweet spot, or should it be less often?

there's no guarantee that a pain point in one browser is a pain point in another browser

I don't think that matters; in fact we wrote that it doesn't matter in our guidelines. The point is that web developers have to write apps that run on multiple browsers, so if one browser is slow at something, it doesn't matter if all of the other ones are fast. The normalized scoring system was designed to help minimize the effect of benchmarks where most browsers are roughly the same in speed.

As a browser developer, I'm going to be much less motivated to care about a perf issue if I have to take your word that it's a macro perf problem and not just a perf problem in microbenchmark you wrote. If on the other hand you can prove to me that the benchmark demonstrates a real-world perf issue (by actually benchmarking real-world code), I'll be much more inclined to investigate.

Totally fair point. We tried to mitigate this in a few ways:

  • The benchmark guidelines are very clear that performance pain points are only valid if they are represent real, documented problems. Right now we have good justification for a number of the benchmarks but they're in email threads and documents. 1) the conversations around them will be shifting to a public github forum, but more importantly 2) we will be adding detailed reasoning about each benchmark to the source of each file.
  • We tag issues in the suite with what apps or frameworks experience them to make it more clear that these are real world issues
  • We have the Technical Advisors program for browser engineers to give feedback on tests they don't think capture what they're supposed to. Of course, responding in issues like this is totally valid—we just wanted to make it easier for you. This way you don't have to be vigilantly watching the activity on the GitHub project; you can wait until we proactively give you a heads up about new tests early in the release planning cycle.

Another problem with microbenchmarks that we haven't covered so far is that they are easy to game. For example, did you know that all of the top 5 browsers's JS engines have a cache that stores the offset for daylight savings time?

We're well aware of that. That's another reason RoboHornet is designed to evolve, to combat those cheats. Some of those cheats will be obvious, but others might require the help of browser engineers like yourself to point out so we can fix them. We're trying to lessen our own pain as web developers, so we're incentivized to root out "cheaters" as much as competing browser engineers are.

I'm looking forward to your responses; I'm sure I'll learn from them. There are many folks in the web ecosystem, including both browser engineers who build the platform (thank you, by the way) and web developers who spend our days building apps on top of it. I think we can all agree that we'd like the web platform to be as fast and powerful as possible.

nnethercote

FWIW, there are some things I liked about RoboHornet: (a) it's not owned by a browser vendor, (b) it's not just JS (like Dromaeo and Peacekeeper), and (c) the voting/dynamic nature is intriguing -- I'll be interested to see how that works in practice.

Now, back to the death-to-microbenchmarks discussion. I appreciate the chicken/egg nature of some of these features. You mentioned local storage. ES5 getters/setters are a similar case: they're slow, so webdevs avoid them, so browser makers don't have much incentive to fix them. (Firefox's bug report for speeding them up is almost 2 years old: https://blog.mozilla.org/nnethercote/2012/08/24/octane-minus-v8/.)

What I'd suggest is to take some significant web site/app that could use the new feature, and modify it to use that new feature in the way you would if that feature was fast. That's harder than writing a microbenchmark, but the result will be much better.

For every pain point you're interested in, ask "how can I benchmark this via a real (possibly modified) page/app?" rather than "how can I benchmark this via a microbenchmark?" You said "Surely including an example of every single app that goes slow because of localStorage isn't feasible." It's not feasible, but picking one app that goes slow is feasible -- do that.


On a different note, your scoring system looks rather "clever", in fact, overly so. The formula given is this:

finalIndex = [(b.baselineScore * b.benchmarkWeight) / b.benchmarkScore for b in benchmarks] * 100

What is the "score" for a benchmark? Is it a 1/time measurement, i.e. a higher score is better? If so, I don't understand why b.benchmarkScore is in the denominator. Also, that looks like a list comprehension but the end result is a number, so that's also odd.

Anyway, H&P talk about this stuff too. First of all, I'd avoid notions of "scores" and just use execution time as the fundamental measurement of each benchmark -- it's much clearer what it means. (Google's V8 benchmark gives a "score" and I've never worked out exactly what it means. In contrast, SunSpider and Kraken give a time so I know exactly what it means.)

Then, if you want a normalized measurement (vs. some reference machine) you should use a geometric mean, not an arithmetic mean (which is what you've currently got). But even a geometric mean has some non-intuitive behaviour, and it's hard to do weightings.

I'd instead recommend keeping it dead simple, and doing what SunSpider does: just report the total time for the whole benchmark suite. This has the wonderful property that it's utterly intuitive what it means. (Another advantage of this over a geometric mean: it encourages people to optimize the slowest tests first, which is what you want them to do.)

As for the weighting, you can get rid of separate weighting coefficients and just modify the size/complexity of the inputs for that, which keeps things simple again.

John-David Dalton
Collaborator

@nnethercote To help keep this thread on track could you post your comments/concerns/bugs for OT issues to separate issues. That way they each get their due attention.

John-David Dalton
Collaborator

jlebar: If you guys want us (in my case, Mozilla) to take robohornet seriously, I strongly recommend you write some macrobenchmarks and eliminate the microbenchmarks from your test suite.

@jlebar Yap, I'm for avoiding overly simplified tests that boil down to 1 or 2 lines in a hot loop. Could you tackle one of the micro-benchmarks you have an issue with and send a PR?

@jkomoros Realistically it's going to be a balance/mix. Some tests could be more broad and cover several primary and secondary focuses. There could be test scenarios with less focus on the specific methods. For example changing a single page app view (involves adding/removing event handlers, innerHTML, and garbage collection).

jkomoros
Owner

You said "Surely including an example of every single app that goes slow because of localStorage isn't feasible." It's not feasible, but picking one app that goes slow is feasible -- do that.

This sounds like a good thing to try. We'll look into this soon.

Justin Lebar

@jlebar Yap, I'm for avoiding overly simplified tests that boil down to 1 or 2 lines in a hot loop. Could you tackle one of the micro-benchmarks you have an issue with and send a PR?

As an OSS developer myself, I don't take any offence to being told "we accept patches." But until RoboHornet demonstrates a committment to benchmarking practices that I feel are sound, I'm not inclined to contribute code. Sorry.

Your point makes sense for primes (that test will be removed/replaced very soon), but does that make sense for, say, the fact that localStorage (to take a random example) is slow? Multiple libraries and apps want to use localStorage—often in very different ways—and in all of them, localStorage is too slow today.

localStorage is an excellent example of how hard it is to write good benchmarks. Indeed, I claim that any automatic localStorage benchmark is trivially gameable.

Proof: An automatic benchmark must have some document which doesn't close itself. (Otherwise it can't open itself again without manual interaction.) Moreover, this document must be able to interact with (*) all the windows it uses for benchmarks, otherwise it can't retrieve the benchmark results from those windows.

localStorage is slow when it has to do synchronous disk reads. If we always served localStorage from RAM, it could be quite fast.

I could easily modify Firefox so that we keep all localStorage writes for a page in RAM until all pages which can interact with that page are closed. This is feasible because we place a limit on how much localStorage data an origin can store on disk, and it would be acceptable to store this small amount of data in RAM.

On the other hand, this change would do little to improve real-world localStorage performance, because pages don't care about how quickly they can retrieve data out of localStorage that they just inserted. Indeed, it makes the browser worse, by unnecessarily wasting RAM on data that's unlikely to be read in the real world. Thus, this trivial change, which works for any automatic benchmark, would constitute "gaming" it. //

I hope this illustrates how well-intentioned but poorly-considered benchmarks can actually be harmful to browsers.

There is in general exactly one way to prevent this kind of cheating, and it is to benchmark exactly the thing that you care about being fast, and not anything else.

(*) A page X can "interact with" a page Y if X and Y can set document.domain to the same value, or WLOG if X was opened by Y or X is an iframe inside Y.

There should be a happy middle ground here, which we tried to hit with a rough four-times-a-year release schedule. Do you feel that hits the sweet spot, or should it be less often?

You'd have to ask someone who does more perf work than I; I'm not sure whether quarterly releases is reasonable or not.

But my uninformed guess would be, if you're frequently removing, modifying, or re-weighting constituent benchmarks, that's probably a sign that you're doing something wrong. On the other hand, if you're frequently adding new benchmarks, that's probably fine.

The benchmark guidelines are very clear that performance pain points are only valid if they are represent real, documented problems. Right now we have good justification for a number of the benchmarks but they're in email threads and documents.

This is a bit frustrating -- on the one hand, you say that we shouldn't judge RoboHornet on the basis of its existing benchmarks, because you intend to change many of them soon. On the other hand, you're saying here that you stand by many of them.

That makes it pretty difficult to suggest that you guys are doing anything wrong, because as soon as someone points out a flaw in a benchmark, it gets put into the pool of bad benchmarks which you'll replace soon, but the pool of good benchmarks continues to contain every benchmark in your suite which Nick and I haven't looked at. It's kind of an unfair position to debate from.

Can you point me at two or three existing benchmarks which you think are a good reflection of your principals, and perhaps briefly justify why the benchmark as it exists is superior to a macrobenchmark which is more closely related to the real-world webpage where you first encountered the performance bottleneck?

Alternatively, we can table this discussion until you guys get the suite into a state which you are prepared to stand by.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.