Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Eliminate and outlaw microbenchmarks #67
Robohornet currently has a bunch of microbenchmarks.
It is well-known that microbenchmarks are a poor way to measure performance.
If you guys want us (in my case, Mozilla) to take robohornet seriously, I strongly recommend you write some macrobenchmarks and eliminate the microbenchmarks from your test suite.
Indeed, if the idea of this suite is that it "measures things web developers actually care about", then surely a benchmark which actually runs handlebars.js would be more sensible than a benchmark which calculates primes and posits that's related to the speed of handlebars.js.
I really like the idea of a community-driven benchmark. I hope that aspect of this project, rather than the microbenchmarks, becomes the hallmark of robohornet.
Fellow Mozilla hacker Nicholas Nethercote wrote a bit more about this on HN. Apparently one of the robohornet benchmarks does nothing at all? Ouch. http://news.ycombinator.com/item?id=4567796
added a commit
Sep 25, 2012
Thanks for your comments! The reason we open sourced RoboHornet in this extremely early state is because we wanted to get feedback from folks on the direction and implementation. There have been a number of great discussions about some of the tests in the early preview (issue #54 is a great example, and directly relevant to the one you mentioned). This is exactly the kind of feedback we wanted to get so we can make RoboHornet as good as possible.
I just added an answer in the FAQ about "micro-benchmarks": https://github.com/robohornet/robohornet/wiki/FAQ . In summary, although these are small, succinct tests, our hope is that they don't have the failings normally associated with "microbenchmarks" because they are motivated by specific pain points and the tests can evolve with browsers to ensure that they continue testing what we want them to test. Of course, there are problems with the specific tests today, which we want to improve as quickly as possible now that we have good feedback on them.
One thing we struggled with was designing a benchmark suite based on specific developer pain points that used "macro" test cases, which can be extremely general. Do you have thoughts on how we could go about harnessing the community's pain points via macro test cases?
referenced this issue
Sep 25, 2012
The thing is, how do you know that X is a pain point in my particular browser unless you have a macrobenchmark? If you have some larger application which is slow because of X, then great -- let's benchmark that. If you don't have a larger application, then on what basis can you claim that X is a problem?
Moreover, even if we know that X is a pain point in your larger application, there's no guarantee that a microbenchmark of X accurately reflects the speed of X when used in your larger application.
If microbenchmarking were feasible in the way you describe, I wouldn't think that Google's JS team would go through all the effort to introduce new macrobenchmarks to the V8 suite. They'd just test the existing "pain points" and be done with it.
I really encourage you to read the some of the literature on perf testing (as Nicholas Nethercote pointed out, Hennessy and Patterson is a good starting point), or to study some benchmarks which are considered good and bad. It's really tempting to think that you're clever enough to design good microbenchmarks, where others have failed. But if anything, the current state of RoboHornet shows just how hard this is.
Sure. "handlebars.js is slow. On my site, the following handlebars.js script takes 500ms to run, but the handlebars.js guys say there's nothing they could do in their code to make it faster." Maybe that's the start of benchmark. Who cares if the critical path in handlebars is adding elements to a table by row, or generating primes, or whatever.
Now, your response might be "I profiled handlebars.js and saw that the perf problem was coming from the fact that it's generating primes and that's slow, so I'm going to write a benchmark which generates primes." But that is totally the wrong way to do it, for a few reasons:
Unsurprisingly, I agree 100% with Justin. Microbenchmarks are not a suitable foundation for high quality benchmarking. The hardware and C/C++/Fortran people learnt this in the 1980s and have been getting it right with the SPEC benchmarks (http://www.spec.org/) since the 1990s. That experience is yet to migrate over the web world, unfortunately.
If I had to boil down the situation to a single principle it would be the following. If you care about the performance of X, then measure the performance of X. Do not measure the performance of Y, where you think that Y is representative of or correlates with X, because there's a very good chance that it doesn't. It sounds trivial when written down like that, but people get this wrong more often than not.
@nnethercote I agree that Y does not necessarily correlate to X, and that in simple terms to measure Y if you mean to measure X is incorrect.
Where I'm coming unstuck is this: if, say, I'm creating an HTML5 game and I can get my draw code happening inside 16ms, but adding code to clear the canvas slows things down significantly, then that's a specific pain point that needs fixing. (Forgive me for bringing this to a single API, I just find it easier to talk about an actual experience rather than generalised approaches.) If I understand your point correctly, and apply it in this example, you would want to know if drawing to the canvas (in general terms) is slower or faster in a specific browser.
I think this would give an overly generalised result as the conclusion I'd draw would be a browser can offset a slow op by simply making everything else faster. The problem there is that clearing is non-negotiable op for a developer in many cases, i.e. in the situation they're compositing the result over other elements, whereas drawing can be tailored by the developer.
I'm disinclined to entirely discard all microbenchmarks because in certain situations a macrobenchmark can cover over an unduly slow API call. However, to your point, microbenchmarks alone do not show all that much by themselves, so I'd personally like to see combination of both, i.e. a macrobenchmark with a detailed breakdown.
I wonder if it would be possible to only have the macrobenchmarks, but show which functions were taking the most time within those (on a per-browser basis, naturally), so that all the browser makers would concentrate on making the whole thing fast but would have some idea of what they needed to target, and so that developers could see what the slow(-est) parts of the macrobenchmarks were, and attempt to avoid them if possible…
For your example, Paul, how would you feel about having two macrobenchmarks, one with the call to clear the canvas, and one without?
It sounds like you're uncomfortable with this, but it is exactly true!
The goal when designing a browser (or a CPU, for that matter) is not to make all of the individual operations as fast as possible. Instead, the goal is to design a browser or CPU which executes programs quickly.
(As an example: A modern CPU will not execute quickly a program which is an unbroken series of indirect branches. But I'm not going to write a benchmark which times nothing but indirect branch performance -- even though indirect branch perf is in fact something I care about, because it affects the speed of virtual function calls -- because useful programs don't just make a million virtual function calls and then quit. If AMD speeds up indirect branch calls and Intel speeds up something else, well, that's nice, but I can't declare a winner until I run a macrobenchmark which is representative of the workload I care about.)
Designing a browser/CPU to execute programs quickly usually means optimizing for common operations, while allowing uncommon ops to be slow. We don't waste or time optimizing operations which are uncommon in the real world, because Amdahl's law shows that's not how we make code run fast.
Under what concrete, real-world situation do you care how long specifically clearing a canvas takes? If there's a situation where the amount of time that takes matters much more than the amount of time that drawing to the canvas takes, then let's benchmark that. But if as soon as you draw something non-trivial to the canvas the cost of drawing greatly exceeds the cost of clearing, then browser vendors have done the right thing by not optimizing our canvas-clearing code.
The way browser developers figure out what parts are slow is by profiling the browser. From a cursory search, there appear to be a number of JS profilers, which would give exactly the information that a web developer would want in this case.
Even better than profiling a benchmark, you can profile your own code and find out exactly why it's slow. But I don't disagree that providing some general guidance (X function is slow in browser Z, use Y function instead) to web developers could be useful.
jlebar is exactly correct. Going back to Paul's example: "I'm creating an HTML5 game and I can get my draw code happening inside 16ms, but adding code to clear the canvas slows things down significantly, then that's a specific pain point that needs fixing. "
Let's assume that clearing the canvas slows it down from 16ms to 24ms. Let's assume that this slowdown is consistent across all browsers. Let's also assume you can create a microbenchmark that accurately measures canvas clearing (already we've seen this is far harder than it first appears). Let's assume you put that in RoboHornet, and that all the browser vendors respond by improving canvas clearing speed so that canvas clearing costs zero time in your game.
That's a lot of assumptions, but hey, this is a though experiment. We've made progress! Your game is now faster. But if the entire game had been used as the benchmark, the browser vendors could investigate how to speed up other parts of the draw code. Perhaps that 16ms could have been reduced to 8ms. By pre-emptively deciding which parts are important to optimize, you've precluded the possibility of optimizing other parts.
Coming back to my previous advice: your game is X. The canvas clearing operation is Y. Measure X, don't measure Y.
I recommend reading section 1.8 (pages 36--44) of H&P's "Computer Architecture" -- you can do so at http://www.amazon.com/Computer-Architecture-Fifth-Quantitative-Approach/dp/012383872X with the "Click to look inside" feature. What we've been calling "microbenchmarks" they call "kernels", which they describe as "small, key pieces of real applications". (The "calculate primes" test, in contrast, is an example of what they call a "toy program".) H&P describe both kernels and toy programs as "discredited" for benchmarking purposes.
The five new benchmarks added to Google's Octane benchmark suite are an example worth following -- they're all real, large, cutting edge apps. (Yes, they're JS-only, but they're an example worth emulating.) I wrote about this at https://blog.mozilla.org/nnethercote/2012/08/24/octane-minus-v8/.
Another problem with microbenchmarks that we haven't covered so far is that they are easy to game. For example, did you know that all of the top 5 browsers's JS engines have a cache that stores the offset for daylight savings time? This is completely useless in the real world, but it helps enormously with one (or is it two?) of the tests in SunSpider which is a microbenchmark that does bajillions of date manipulations. I think Firefox was the last browser to implement this; we resisted for a long time but ultimately we had to do it to be competitive on SunSpider, which the tech press obsesses over. Bad browser benchmarks hurt the web.
Another case: V8's "regexp" benchmark does lots of RegExp.exec() calls. It uses regexps culled from real-world sites, which is good. Unfortunately, it never looks at the result of any of these calls. Therefore, if you detect this you can avoid constructing the return value (which is an array, and thus expensive to build) and get a much better score. This is of little use in real-world code, however.
Actually, this is the source of a bug in Firefox: Once we populate this cache we never check if your system timezone has changed, so if the timezone changes while Firefox is running, I believe all your Date objects are incorrect (until you restart the browser).
We're fixing this for B2G and I hope we'll fix it on other operating systems after that, but it's sad that we had to implement an unsound optimization to compete on a useless microbenchmark.
Sorry for the delayed response; I've been traveling. This kind of discussion is exactly the kind of thing we were hoping to encourage by open sourcing RoboHornet in an extremely early state, and ultimately will help us build the best benchmark we can that's community driven and centered around real-world pain points. I want to again thank you guys for having such an active discussion here and helping us improve.
We started RoboHornet because we noticed that there were many APIs in the web platform that we relied on that went too slow. The existing benchmarks didn't do a good job of exercising them and browser vendors historically don't do a great job of listening to performance feedback from web developers unless there's a benchmark attached. So we decided the way to fix it was to come up with our own benchmark—a way for web developers to have an organized voice that browser vendors would actually listen to. The fact that we're having such an active discussion on this thread is a very positive sign. :-)
Let's take these in pieces...
Your point makes sense for primes (that test will be removed/replaced very soon), but does that make sense for, say, the fact that localStorage (to take a random example) is slow? Multiple libraries and apps want to use localStorage—often in very different ways—and in all of them, localStorage is too slow today. Surely including an example of every single app that goes slow because of localStorage isn't feasible. Could it be reasonable to try to come up with a composite test that captures the commonalities of the cases where it's slow? At the beginning when the APIs haven't been even trivially optimized, even simple tests will capture the pain well. Over time as browsers get better and better the tests may need to become far more complex (likely to the point where folks would call them "macro benchmarks"). At some point the performance will be good enough that web developers are happy because their real world performance problems are fixed on that issue; at that point it's fine to not continue iterating on the test. Again, we're mainly trying to focus attention of browser vendors on low-hanging fruit they might otherwise ignore.
Making the benchmark easy to run for browser engineers is definitely more important than most people think—it's one of the reasons we made it so you can easily run a single test from the harness (although there may be some problems with that implementation at the moment). But remember that web developers are dealing with the pain of slow parts of the platform every day as we write apps. A static benchmark wouldn't work because there are lots of things in the web platform that we want to be fast, and new ones coming every day. There should be a happy middle ground here, which we tried to hit with a rough four-times-a-year release schedule. Do you feel that hits the sweet spot, or should it be less often?
I don't think that matters; in fact we wrote that it doesn't matter in our guidelines. The point is that web developers have to write apps that run on multiple browsers, so if one browser is slow at something, it doesn't matter if all of the other ones are fast. The normalized scoring system was designed to help minimize the effect of benchmarks where most browsers are roughly the same in speed.
Totally fair point. We tried to mitigate this in a few ways:
We're well aware of that. That's another reason RoboHornet is designed to evolve, to combat those cheats. Some of those cheats will be obvious, but others might require the help of browser engineers like yourself to point out so we can fix them. We're trying to lessen our own pain as web developers, so we're incentivized to root out "cheaters" as much as competing browser engineers are.
FWIW, there are some things I liked about RoboHornet: (a) it's not owned by a browser vendor, (b) it's not just JS (like Dromaeo and Peacekeeper), and (c) the voting/dynamic nature is intriguing -- I'll be interested to see how that works in practice.
Now, back to the death-to-microbenchmarks discussion. I appreciate the chicken/egg nature of some of these features. You mentioned local storage. ES5 getters/setters are a similar case: they're slow, so webdevs avoid them, so browser makers don't have much incentive to fix them. (Firefox's bug report for speeding them up is almost 2 years old: https://blog.mozilla.org/nnethercote/2012/08/24/octane-minus-v8/.)
What I'd suggest is to take some significant web site/app that could use the new feature, and modify it to use that new feature in the way you would if that feature was fast. That's harder than writing a microbenchmark, but the result will be much better.
For every pain point you're interested in, ask "how can I benchmark this via a real (possibly modified) page/app?" rather than "how can I benchmark this via a microbenchmark?" You said "Surely including an example of every single app that goes slow because of localStorage isn't feasible." It's not feasible, but picking one app that goes slow is feasible -- do that.
On a different note, your scoring system looks rather "clever", in fact, overly so. The formula given is this:
finalIndex = [(b.baselineScore * b.benchmarkWeight) / b.benchmarkScore for b in benchmarks] * 100
What is the "score" for a benchmark? Is it a 1/time measurement, i.e. a higher score is better? If so, I don't understand why b.benchmarkScore is in the denominator. Also, that looks like a list comprehension but the end result is a number, so that's also odd.
Anyway, H&P talk about this stuff too. First of all, I'd avoid notions of "scores" and just use execution time as the fundamental measurement of each benchmark -- it's much clearer what it means. (Google's V8 benchmark gives a "score" and I've never worked out exactly what it means. In contrast, SunSpider and Kraken give a time so I know exactly what it means.)
Then, if you want a normalized measurement (vs. some reference machine) you should use a geometric mean, not an arithmetic mean (which is what you've currently got). But even a geometric mean has some non-intuitive behaviour, and it's hard to do weightings.
I'd instead recommend keeping it dead simple, and doing what SunSpider does: just report the total time for the whole benchmark suite. This has the wonderful property that it's utterly intuitive what it means. (Another advantage of this over a geometric mean: it encourages people to optimize the slowest tests first, which is what you want them to do.)
As for the weighting, you can get rid of separate weighting coefficients and just modify the size/complexity of the inputs for that, which keeps things simple again.
@jlebar Yap, I'm for avoiding overly simplified tests that boil down to 1 or 2 lines in a hot loop. Could you tackle one of the micro-benchmarks you have an issue with and send a PR?
@jkomoros Realistically it's going to be a balance/mix. Some tests could be more broad and cover several primary and secondary focuses. There could be test scenarios with less focus on the specific methods. For example changing a single page app view (involves adding/removing event handlers, innerHTML, and garbage collection).
referenced this issue
Sep 27, 2012
This sounds like a good thing to try. We'll look into this soon.
As an OSS developer myself, I don't take any offence to being told "we accept patches." But until RoboHornet demonstrates a committment to benchmarking practices that I feel are sound, I'm not inclined to contribute code. Sorry.
localStorage is an excellent example of how hard it is to write good benchmarks. Indeed, I claim that any automatic localStorage benchmark is trivially gameable.
Proof: An automatic benchmark must have some document which doesn't close itself. (Otherwise it can't open itself again without manual interaction.) Moreover, this document must be able to interact with (*) all the windows it uses for benchmarks, otherwise it can't retrieve the benchmark results from those windows.
localStorage is slow when it has to do synchronous disk reads. If we always served localStorage from RAM, it could be quite fast.
I could easily modify Firefox so that we keep all localStorage writes for a page in RAM until all pages which can interact with that page are closed. This is feasible because we place a limit on how much localStorage data an origin can store on disk, and it would be acceptable to store this small amount of data in RAM.
On the other hand, this change would do little to improve real-world localStorage performance, because pages don't care about how quickly they can retrieve data out of localStorage that they just inserted. Indeed, it makes the browser worse, by unnecessarily wasting RAM on data that's unlikely to be read in the real world. Thus, this trivial change, which works for any automatic benchmark, would constitute "gaming" it. //
I hope this illustrates how well-intentioned but poorly-considered benchmarks can actually be harmful to browsers.
There is in general exactly one way to prevent this kind of cheating, and it is to benchmark exactly the thing that you care about being fast, and not anything else.
(*) A page X can "interact with" a page Y if X and Y can set document.domain to the same value, or WLOG if X was opened by Y or X is an iframe inside Y.
You'd have to ask someone who does more perf work than I; I'm not sure whether quarterly releases is reasonable or not.
But my uninformed guess would be, if you're frequently removing, modifying, or re-weighting constituent benchmarks, that's probably a sign that you're doing something wrong. On the other hand, if you're frequently adding new benchmarks, that's probably fine.
This is a bit frustrating -- on the one hand, you say that we shouldn't judge RoboHornet on the basis of its existing benchmarks, because you intend to change many of them soon. On the other hand, you're saying here that you stand by many of them.
That makes it pretty difficult to suggest that you guys are doing anything wrong, because as soon as someone points out a flaw in a benchmark, it gets put into the pool of bad benchmarks which you'll replace soon, but the pool of good benchmarks continues to contain every benchmark in your suite which Nick and I haven't looked at. It's kind of an unfair position to debate from.
Can you point me at two or three existing benchmarks which you think are a good reflection of your principals, and perhaps briefly justify why the benchmark as it exists is superior to a macrobenchmark which is more closely related to the real-world webpage where you first encountered the performance bottleneck?
Alternatively, we can table this discussion until you guys get the suite into a state which you are prepared to stand by.