Our current benchmarking setup measures executed CPU instructions, which according to this blog post can reliably identify the speed difference between two versions of rustls:
It correlates fairly well with wall-time. You’d be mad to use it to compare the speed of two different programs, but it is very useful for comparing two slightly different versions of the same program, which are likely to have a similar instruction mix.
IMO we can safely assume most PRs propose changes that are "slightly different" from the existing version of the code, meaning that the instruction count is a reliable metric to judge their performance impact.
What about bigger changes? Talking to @nnethercote (author of the quote above), he mentioned that the correlation between instruction counts and wall-time decreases when there are bigger changes, or when they significantly affect memory layout. It will probably take some time to develop an intuition of which changes fall under this category (maybe #1448 is an instance), but it is clear that we need a setup to obtain wall-time measurements in those cases.
I'm currently thinking of adding flags to the bench runner that allow:
- Measuring wall-time instead of icounts (running each benchmark multiple times);
- Comparing the resulting time distributions between two runs.
The idea would be to manually trigger a wall-time bench run when a reviewer considers it necessary.
By the way, this all assumes we can run them on dedicated, properly configured, hardware. I'm currently arranging an OVH bare-metal machine sponsored by ISRG (let me know if you'd prefer something else).
Our current benchmarking setup measures executed CPU instructions, which according to this blog post can reliably identify the speed difference between two versions of rustls:
IMO we can safely assume most PRs propose changes that are "slightly different" from the existing version of the code, meaning that the instruction count is a reliable metric to judge their performance impact.
What about bigger changes? Talking to @nnethercote (author of the quote above), he mentioned that the correlation between instruction counts and wall-time decreases when there are bigger changes, or when they significantly affect memory layout. It will probably take some time to develop an intuition of which changes fall under this category (maybe #1448 is an instance), but it is clear that we need a setup to obtain wall-time measurements in those cases.
I'm currently thinking of adding flags to the bench runner that allow:
The idea would be to manually trigger a wall-time bench run when a reviewer considers it necessary.
By the way, this all assumes we can run them on dedicated, properly configured, hardware. I'm currently arranging an OVH bare-metal machine sponsored by ISRG (let me know if you'd prefer something else).