Skip to content

Calculate and add p-value and stddev to summary table#479

Merged
eightbitraptor merged 3 commits intomainfrom
mvh-add-p-value
Feb 18, 2026
Merged

Calculate and add p-value and stddev to summary table#479
eightbitraptor merged 3 commits intomainfrom
mvh-add-p-value

Conversation

@eightbitraptor
Copy link
Contributor

@eightbitraptor eightbitraptor commented Feb 18, 2026

Trying to add some way of determining whether the difference in results from one Ruby to the other could be reproduced by random noise, or if the difference is actually significant.

Using Welch's t-test for p-values because I don't want to assume that both versions of Ruby tested are going to have equal timing distributions.

Full p-value information is gated behind --pvalue in order to not bloat out the width of the table.

Without --pvalue asterisks will be displayed in the ration column as follows:

Average of last 30, non-warmup iters: 224ms
Total time spent benchmarking: 20s

master: ruby 4.1.0dev (2026-02-17T20:46:23Z master 997bc709db) +YJIT +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-02-18T14:18:38Z mvh-introduce-48-b.. 302d5d3397) +YJIT +PRISM [x86_64-linux]

-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------
bench        master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment
knucleotide  233.3        0.4         224.2            0.3         1.048               1.040 (***)
-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
- ***: p < 0.001, **: p < 0.01, *: p < 0.05 (Welch's t-test)

and will --pvalue full details are shown:

Average of last 30, non-warmup iters: 225ms
Total time spent benchmarking: 20s

master: ruby 4.1.0dev (2026-02-17T20:46:23Z master 997bc709db) +YJIT +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-02-18T14:18:38Z mvh-introduce-48-b.. 302d5d3397) +YJIT +PRISM [x86_64-linux]

-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------  -------  ---------
bench        master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment  p-value  sig
knucleotide  233.3        0.4         225.2            0.3         1.044               1.036 (***)        1.3e-36  p < 0.001
-----------  -----------  ----------  ---------------  ----------  ------------------  -----------------  -------  ---------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
- ***: p < 0.001, **: p < 0.01, *: p < 0.05 (Welch's t-test)

Using Welch's t-test for p-values because I don't want to assume that
both versions of Ruby tested are going to have equal timing
distributions.
@eightbitraptor eightbitraptor marked this pull request as draft February 18, 2026 15:34
@eightbitraptor eightbitraptor marked this pull request as ready for review February 18, 2026 15:34
@k0kubun
Copy link
Member

k0kubun commented Feb 18, 2026

Could we make it an optional feature like --rss, e.g. --p-value? The table seems so wide that it doesn't seem to fit in GitHub PR/issue descriptions/comments. When you're verifying results yourself, this would be nice to have, but when you're showing results to other people, you might want to make the table smaller to make it easier for them to find the most important numbers.

@jhawthorn
Copy link
Member

I'd really like to have some indication of statistical significance every time this is reported. I think that is one of the most important numbers.

Maybe if horizontal space is at a premium could just be tagging the last column with * if there's a difference (or maybe an emoji indicating faster vs slower?)

@eightbitraptor
Copy link
Contributor Author

@k0kubun @jhawthorn what about using a single extra column showing an icon depending on whether the p-value is < 0.001, <0.01, <0.05 - and blank otherwise.

And then the exact values can be gated behind the --pvalue option?

Also bear in mind that the example I pasted here also has the --rss option on, so it's already 2 columns wider than the default.

@k0kubun
Copy link
Member

k0kubun commented Feb 18, 2026

sounds good to me 👍

actual p-value still gated behind --pvalue
@eightbitraptor
Copy link
Contributor Author

Ok, fixed, and updated PR description with examples. Thanks both.

@eightbitraptor eightbitraptor merged commit c64ac86 into main Feb 18, 2026
11 checks passed
@eightbitraptor eightbitraptor deleted the mvh-add-p-value branch February 18, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments