Benchmark Scoring

jkomoros edited this page Sep 27, 2012 · 4 revisions

Benchmark Scoring

RoboHornet uses a unique approach to scoring that is based on the current performance of major browsers and how important the issue is to the community.

Because the scores are effectively normalized for every release of the benchmark, a score is only directly comparable to other scores in the same version of the benchmark run on the same hardware. That's why the overall score is referred to as the RoboHornet Index.

What the RoboHornet Index means, intuitively

The RoboHornet Index, intuitively, means how far ahead or behind the pack the browser is compared to the current crop of stable browsers on "average" hardware. An index of 100 means it is in the middle of the pack. Below 100 means it is below the pack, and above 100 means better than the pack. It is recalibrated each release of the benchmark based on the current performance of stable browsers. The index is thus relative to other browsers, and only runs of the current benchmark version should be compared to one another, as the composition of the benchmark changes often. Remember that your hardware's performance profile is conflated with your browser's performance in the index, so you should only compare scores of the same version of the benchmark on the same machine.

For more about the reasoning behind RoboHornet's scoring mechanism, see the scoring rationale page.

Calculating the RoboHornet Index

The index for a given browser is straight forward given the weighting for each benchmark and the baseline score for each benchmark. The weighting puts more focus on tests that are more important than others, and is a function of community-perceived importance (using votes on the issue as a proxy) and the stewardship committee’s judgement. The baseline score is calculated based on the performance of popular current-generation browsers. The weighting and baseline scores are calculated fresh for each release of the benchmark. For more detail about what the benchmark weight represents, see the section below. For more detail about calculating these scores, see the Updating Baseline Scores and Benchmark Weights section.

finalIndex = sum([(b.baselineScore * b.benchmarkWeight) / b.benchmarkScore * 100 for b in benchmarks])

Weighting of each benchmark

Each benchmark in the suite has its own weight. This is to reflect the fact that some benchmarks are more important than others in terms of real felt pain. The weight is a function of community-perceived importance (using voting on the issue as a proxy), and the stewardship committee’s expert judgement.

We use votes as a proxy for community-perceived importance of an issue. We use the log of the number of votes in the calculation to help minimize the "Slashdot effect". Thus, number of votes is correlated with the weighting, but the relationship is weak enough that there isn’t a huge incentive to game the number of vote.

Stewardship committee oversight

The expert committee has the ability to “soften”--but not completely counteract--the will of the community to correct for cases where the voting system doesn’t work. These cases could include:

  • "Slashdot effect" for an issue
  • A serious issue being underrepresented in votes, despite being known to be a large problem for a large team or product.
  • Representing the interests of large apps with many millions of users, like Gmail or jQuery.

These cases do not include:

  • Removing weighting from issues that it’s “hard” for browser vendors to fix
  • Removing weighting from issues that make a particular browser or browser version look bad

Conceptually, the expert panel is given precisely half as many votes as the user community has given to the issues tracking the benchmarks in the current suite, to distribute as they see fit.

For more detail about the specific weighting process, see the Updating Baseline Scores and Benchmark Weights section.