Calculate and add p-value and stddev to summary table by eightbitraptor · Pull Request #479 · ruby/ruby-bench

eightbitraptor · 2026-02-18T15:31:50Z

Trying to add some way of determining whether the difference in results from one Ruby to the other could be reproduced by random noise, or if the difference is actually significant.

Using Welch's t-test for p-values because I don't want to assume that both versions of Ruby tested are going to have equal timing distributions.

Full p-value information is gated behind --pvalue in order to not bloat out the width of the table.

Without --pvalue asterisks will be displayed in the ration column as follows:

Average of last 30, non-warmup iters: 224ms
Total time spent benchmarking: 20s

master: ruby 4.1.0dev (2026-02-17T20:46:23Z master 997bc709db) +YJIT +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-02-18T14:18:38Z mvh-introduce-48-b.. 302d5d3397) +YJIT +PRISM [x86_64-linux]

-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------
bench        master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment
knucleotide  233.3        0.4         224.2            0.3         1.048               1.040 (***)
-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
- ***: p < 0.001, **: p < 0.01, *: p < 0.05 (Welch's t-test)

and will --pvalue full details are shown:

Average of last 30, non-warmup iters: 225ms
Total time spent benchmarking: 20s

master: ruby 4.1.0dev (2026-02-17T20:46:23Z master 997bc709db) +YJIT +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-02-18T14:18:38Z mvh-introduce-48-b.. 302d5d3397) +YJIT +PRISM [x86_64-linux]

-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------  -------  ---------
bench        master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment  p-value  sig
knucleotide  233.3        0.4         225.2            0.3         1.044               1.036 (***)        1.3e-36  p < 0.001
-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------  -------  ---------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
- ***: p < 0.001, **: p < 0.01, *: p < 0.05 (Welch's t-test)

Using Welch's t-test for p-values because I don't want to assume that both versions of Ruby tested are going to have equal timing distributions.

k0kubun · 2026-02-18T16:04:18Z

Could we make it an optional feature like --rss, e.g. --p-value? The table seems so wide that it doesn't seem to fit in GitHub PR/issue descriptions/comments. When you're verifying results yourself, this would be nice to have, but when you're showing results to other people, you might want to make the table smaller to make it easier for them to find the most important numbers.

jhawthorn · 2026-02-18T16:41:52Z

I'd really like to have some indication of statistical significance every time this is reported. I think that is one of the most important numbers.

Maybe if horizontal space is at a premium could just be tagging the last column with * if there's a difference (or maybe an emoji indicating faster vs slower?)

eightbitraptor · 2026-02-18T16:57:13Z

@k0kubun @jhawthorn what about using a single extra column showing an icon depending on whether the p-value is < 0.001, <0.01, <0.05 - and blank otherwise.

And then the exact values can be gated behind the --pvalue option?

Also bear in mind that the example I pasted here also has the --rss option on, so it's already 2 columns wider than the default.

k0kubun · 2026-02-18T17:04:38Z

sounds good to me 👍

actual p-value still gated behind --pvalue

eightbitraptor · 2026-02-18T17:18:36Z

Ok, fixed, and updated PR description with examples. Thanks both.

Calculate and add p-value and stddev to summary table

7bd2f2d

Using Welch's t-test for p-values because I don't want to assume that both versions of Ruby tested are going to have equal timing distributions.

eightbitraptor marked this pull request as draft February 18, 2026 15:34

eightbitraptor marked this pull request as ready for review February 18, 2026 15:34

Gate P-value calculations behind --pvalue

33ea8ce

Add asterisks to the comparison for p-value

bf3b7b3

actual p-value still gated behind --pvalue

eightbitraptor force-pushed the mvh-add-p-value branch from 10a2725 to bf3b7b3 Compare February 18, 2026 17:16

jhawthorn mentioned this pull request Feb 18, 2026

Combine measurement and stddev columns #480

Merged

k0kubun approved these changes Feb 18, 2026

View reviewed changes

eightbitraptor merged commit c64ac86 into main Feb 18, 2026
11 checks passed

eightbitraptor deleted the mvh-add-p-value branch February 18, 2026 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate and add p-value and stddev to summary table#479

Calculate and add p-value and stddev to summary table#479
eightbitraptor merged 3 commits intomainfrom
mvh-add-p-value

eightbitraptor commented Feb 18, 2026 •

edited

Loading

Uh oh!

k0kubun commented Feb 18, 2026 •

edited

Loading

Uh oh!

jhawthorn commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

k0kubun commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

eightbitraptor commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k0kubun commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhawthorn commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

k0kubun commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

eightbitraptor commented Feb 18, 2026 •

edited

Loading

k0kubun commented Feb 18, 2026 •

edited

Loading