Scoring Rationale

jkomoros edited this page Sep 8, 2012 · 3 revisions

Rationale for RoboHornet's Scoring

The RoboHornet benchmark makes use of an interesting--and perhaps unique--scoring mechanism. This document lays out the reasoning behind it.

First, let's be very clear: the only permissible way to interpret the results of RoboHornet is relative to the results of other browsers on the same version of the suite running on the same OS and hardware. If this were just another benchmark, we'd come up with the simplest possible scoring method and be done, because the actual numbers are irrelevant as long as you interpret them correctly. But RoboHornet is not just another benchmark: it is first and foremost a social tool for web developers to exert pressure on browser vendors to improve performance problems they care about. This goal leads directly and indirectly to many of the interesting properties of the scoring.

Some people will have the inclination and necessary background knowledge to invest the time in interpreting the results correctly. Others won't. RoboHornet's collective power comes from as many as people knowing and caring about the results as possible. For RoboHornet to be effective, people's intuitive interpretation of the results needs to at the very least align with the correct interpretation in order to apply pressure to browser vendors in the correct way.

We need to balance complexity of scoring with correctness of intuitive understanding. Baseline scores and weighting are the two main sources of complexity in the scoring. Together they help "normalize" the score (although perfect normalization is impossible for a host of reasons). Having a normalized score allows even a naive observer to glance at the score and come away with a generally correct understanding without consulting anything else. It is impossible to normalize for more than one specific piece of hardware, but for the hypothetical person using the reference hardware 100 serves as a mental anchor point: above 100 is good; below 100 is bad. If that's all you understand, at least you're on the right track.

There are a few nice properties of having a "normalized" score:

Scores go down for less competitive browsers

The point of RoboHornet is to encourage active competition among browser vendors to improve on important performance problems. Browser vendors should strive not just to improve, but to improve faster than their peers. Baselining means that if a browser is improving--but not as quickly as its peers--its index will decline. Numbers going down (as opposed to simply not increasing as quickly) is an easy signal to interpret and makes the competitive incentive stronger.

Allowing variable importance per benchmark

Some benchmarks are more important than others because they represent more felt pain (either by number of instances of pain or severity of pain), but we also want to keep the benchmark comprehensive. Weighting allows us to direct attention to the most important problems easily while still including a slew of issues.

Makes evolving over time easier

RoboHornet is designed to evolve over time to reflect the will of web developers and the current browser landscape. Without normalization, users of the benchmark would have to mentally adjust for the "new normal" of each version. A combination of weighting and careful design of new benchmarks could help reduce the bump, but it would require active effort. Normalization helps automatically smooth the score bumps between versions of the benchmark suite, making trends easier to interpret.

Makes aspirational benchmarks easier to account for

Some benchmarks are known hard problems for browser vendors based on today's implementations--but that doesn't mean we shouldn't strive to do better. In the meantime, however, these aspirational benchmarks will have little competition, because most browsers will be at the same local maxima. When this happens, the score on that benchmark will tend towards the anchor point (100), which then leads the final index to simply "squash" a small amount towards the anchor point as well. This squashing effect gets smaller as more benchmarks are included in the suite. Baselining thus minimizes the impact of the aspirational benchmarks on the index while they lay fallow, while still providing incentive to improve.

Why we picked this reference hardware

No normalization can ever be perfect because it's not possible to control for more than one of the myriad possible hardware and OS variations. Our goal in picking reference hardware for the baselines is to minimize complexity while maximizing the usefulness of the final index.

The reference hardware is a January-2012 era MacBook Pro with the following specs:

  • 2.2GHz quad-core Intel Core i7 processor
  • 500GB 5400-rpm hard drive
  • AMD Radeon HD 6750M with 512MB GDDR5 memory
  • 4GB memory
  • 15.4-inch LED-backlit display (1440 by 900 pixels)
  • Dual booting into Windows 7 Service Pack 1 and Mac OS X 10.8.1

The exact machine profile will be upgraded over time, roughly once a year, to keep pace with the status quo.

The reasoning is as follows:

  • Having a single machine and baseline profile vastly simplifies the process
  • Having a single computer that can run both Windows and Mac allows us to control for hardware in the baseline numbers
  • The specs are slightly ahead of the curve, which both more accurately represents the tech aficionados who are likely to run the benchmark in practice and reflects the specs that the general population are trending towards over time
  • Laptops are increasingly popular compared to desktops

Note that browser vendors are strongly encouraged to ignore the index score and instead use the raw scores to avoid over-optimizing for the specific characteristics of the hardware that just happened to be used for the baselines.

Briefly, some of the other baselining schemes that we considered but ultimately discarded:

Multiple baseline profiles

One option we considered was to have multiple baseline profiles (e.g. Windows-High-End, Windows-Netbook, Mac-High-End) and allow the user to pick which profile to compare to. We decided against it for a number of reasons. First, it complicates the baselining process considerably. Second, it introduced extra complexity for the lay-person running the benchmark, which went against the goal of easy intuitive understanding. Third, it encouraged unrealistic expectations for the "normalization" of the Index. The normalization will always be at best a very rough approximation. Removing some of the error gives a false sense of accomplishment without addressing the core problem. In the end we decided it was best to go with a single, simple baseline and not pretend that it was perfect.

Baseline profiles based on real-world usage data

Another option we considered was coming up with hardware profiles based on real-world usage data. This proved difficult because key metrics are difficult to gauge remotely (like CPU speed), and there are myriad GPUs. Further, it was unlikely that we'd be able to pick a single machine capable of dual-booting both Mac OS X and Windows.

Using user-reported scores to baseline

Another idea was to use user-reported scores from running the benchmark to compute the baseline. Note that although we do not plan to use these scores for the actual baseline, we do plan on collecting them and displaying aggregate details to help runners of the benchmark interpret the results. Results from actual users would be overly biased towards the browsers and hardware that runners of the benchmark happened to use. Further, either the baseline would have to be finalized some short period of time after the release of each version of the benchmark (making it more likely that sampling problems could affect the score), or it would continue to drift over the life of the version (making scores not be consistent even within a single version of the benchmark on the same browser/OS/hardware combination).