-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve runtime benchmark CLI output #1463
Conversation
2051afc
to
f492961
Compare
|
I'm concerned about our terminology here. I.e. the meaning of words like "result", "measurement", "metric". It looks like this patch is changing some of these in the code. We have a glossary, I'll need to look over it carefully to see if the code matches. (And if some of the definitions in the glossary are bad, we can consider changing them.) |
|
I agree that the terminology of the runtime benchmarks should be improved. The annoying thing is that the runtime benchmarks will have a slightly different structure than the compile time benchmarks. For example, currently there are individual benchmarks, but they are also grouped in benchmark suites (binaries), which doesn't really exist for comptime. Another thing is that we will most likely measure each runtime benchmark repeatedly (the code already does this), which again isn't done for comptime benchmarks. So maybe we will need to update some of the terminology. I'll start with aligning this PR with the existing terms. |
|
I updated the code and documentation to hopefully better align with the previous meanings of the terms. |
f89e283
to
b3e79ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't mentioned every place you'll need to change, e.g. there are mentions of "suite" in various comments that should be changed to "group" or "binary". And the benchmark_suite macro added in #1464 should be changed to benchmark_group.
|
I think what I've suggested above makes sense. However, I am now wondering if we really need the "benchmark" vs "benchmark group" distinction. I can see why some people might want, for example, to have multiple hashmap benchmarks: insertions, deletions, mixed, etc. But I also think that pushes us strongly in the direction of microbenchmarks, which I want to avoid. I'd prefer to have, for example, a single hashmap benchmark that replays a variety of insertions/deletions/operations that come from a real program. (Relatedly, for the compile-time benchmarks, at one point we had multiple What do you think? |
|
I like the new terminology, suite -> group -> benchmark makes sense. I'm opposed to removing the groups though. If we don't have groups, we'd need to either have a benchmark per binary (lots of unnecessary code in the repo, additional overhead for building the benchmarks, additional overhead for running the benchmark suite) or all benchmarks in a single binary (inability to choose different dependency versions or compilation flags for different benchmarks, higher latency for compilation when experimenting with the benchmarks). It's true that if we had a binary per benchmark, this would probably discourage micro-benchmarks somewhat, but this doesn't seem like a good enforcement for this, it seems to me that it would actually discourage from adding new benchmarks altogether, if you always needed to create a new crate rather than just add a new macro and a closure in some existing benchmark. I also don't think that it's so useful to discourage microbenchmarks in general. Definitely, it's great if we have snippets from real-world programs. But I also think that micro-benchmarking things that are crucial and ubiquitous (like sorting, hashmap ops) is worth it. With the runtime benchmarks, we have a big advantage vs compile time benchmarks, in that we can benchmark exactly the thing that interests us. With a compile time benchmark, even if you compile hello world, you will always execute an enormous amount of code which isn't really hello world specific. Therefore it might not make sense to have 5 CTFE benchmarks, since anyway it will be hard to distinguish them, as e.g. only 10 % of the execution will be benchmark specific. But with runtime benchmarks, we can actually distinguish 5 different microbenchmarks, and if only one of them regresses, it might make it easier to investigate what happens. With a hashmap benchmark that does 5 things, it will take more time to investigate what has regressed than if we had 5 microbenchmarks where each one does only a single thing. Of course, there are also disadvantages to micro benchmarks (more noise, not applicable for real world usage), so we should avoid things like |
database -> suite suite -> group
b3e79ca
to
1a633b0
Compare
|
I'm happy to continue with groups for now, even though all the current groups only have one benchmark in them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few suggestions for small changes, but I think we're very close.
|
I addressed the comments. I think that the only thing left is to decide on the name for |
|
Good enough, IMO. Ship it! |
Benchmark results are now printed interactively and they output basic statistics of the results:
