Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: reorganize benchmarks for comparability #2178

Merged
merged 2 commits into from
Jun 24, 2022

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jun 24, 2022

Currently, tracing's benchmark suite benchmarks the same behaviors
(e.g. creating a span, recording an event, etc) across a handful of
cases: with no default dispatcher, with a global default, and with a
scoped (thread-local) default. We use criterion's benchmark_group to
represent each kind of dispatcher, and bench_function for each
behavior being measured.

This is actually kind of backwards relative to how Criterion is
supposed to be used. bench_function is intended for comparing
different implementations of the same thing, with the
benchmark_group representing what's being compared. If we inverted the
structure of these benchmarks, Criterion would give us nicer plots that
would allow comparing the performance of each dispatch type on the same
task.

This PR reorganizes the benchmarks so that each behavior being tested
(such as entering a span or recording an event) is a benchmark_group,
and each dispatch type (none, global, or scoped) is a bench_function
within that group. Now, Criterion will generate plots for each group
comparing the performance of each dispatch type in that benchmark.

For example, we now get nice comparisons like this:
image

Unfortunately, this required splitting each benchmark type out into its
own file. This is because, once we set the global default dispatcher
within one benchmark group, it would remain set for the entire lifetime
of the process --- there would be no way to test something else with no
global default. But, I think this is fine, even though it means we now
have a bunch of tiny files: it also allows us to run an individual
benchmark against every combination of dispatch types, without having to
run unrelated benches. This is potentially useful if (for example)
someone is changing only the code for recording events, and not spans.

Currently, `tracing`'s benchmark suite benchmarks the same behaviors
(e.g. creating a span, recording an event, etc) across a handful of
cases: with no default dispatcher, with a global default, and with a
scoped (thread-local) default. We use criterion's `benchmark_group` to
represent each kind of dispatcher, and `bench_function` for each
behavior being measured.

This is actually kind of backwards relative to how Criterion is
_supposed_ to be used. `bench_function` is intended for comparing
different implementations of the *same* thing, with the
`benchmark_group` representing what's being compared. If we inverted the
structure of these benchmarks, Criterion would give us nicer plots that
would allow comparing the performance of each dispatch type on the same
task.

This PR reorganizes the benchmarks so that each behavior being tested
(such as entering a span or recording an event) is a `benchmark_group`,
and each dispatch type (none, global, or scoped) is a `bench_function`
within that group. Now, Criterion will generate plots for each group
comparing the performance of each dispatch type in that benchmark.

Unfortunately, this required splitting each benchmark type out into its
own file. This is because, once we set the global default dispatcher
within one benchmark group, it would remain set for the entire lifetime
of the process --- there would be no way to test something else with no
global default. But, I think this is fine, even though it means we now
have a bunch of tiny files: it also allows us to run an individual
benchmark against every combination of dispatch types, without having to
run unrelated benches. This is potentially useful if (for example)
someone is changing only the code for recording events, and not spans.
@hawkw hawkw requested review from davidbarsky and a team as code owners June 24, 2022 20:00
@hawkw hawkw merged commit 7008df9 into master Jun 24, 2022
@hawkw hawkw deleted the eliza/benchmark-consistency branch June 24, 2022 20:30
hawkw added a commit that referenced this pull request Jun 24, 2022
Currently, `tracing`'s benchmark suite benchmarks the same behaviors
(e.g. creating a span, recording an event, etc) across a handful of
cases: with no default dispatcher, with a global default, and with a
scoped (thread-local) default. We use criterion's `benchmark_group` to
represent each kind of dispatcher, and `bench_function` for each
behavior being measured.

This is actually kind of backwards relative to how Criterion is
_supposed_ to be used. `bench_function` is intended for comparing
different implementations of the *same* thing, with the
`benchmark_group` representing what's being compared. If we inverted the
structure of these benchmarks, Criterion would give us nicer plots that
would allow comparing the performance of each dispatch type on the same
task.

This PR reorganizes the benchmarks so that each behavior being tested
(such as entering a span or recording an event) is a `benchmark_group`,
and each dispatch type (none, global, or scoped) is a `bench_function`
within that group. Now, Criterion will generate plots for each group
comparing the performance of each dispatch type in that benchmark.

For example, we now get nice comparisons like this:
![image](https://user-images.githubusercontent.com/2796466/175659314-835664ac-a8cf-4b07-91ee-f8cfee77bfbb.png)

Unfortunately, this required splitting each benchmark type out into its
own file. This is because, once we set the global default dispatcher
within one benchmark group, it would remain set for the entire lifetime
of the process --- there would be no way to test something else with no
global default. But, I think this is fine, even though it means we now
have a bunch of tiny files: it also allows us to run an individual
benchmark against every combination of dispatch types, without having to
run unrelated benches. This is potentially useful if (for example)
someone is changing only the code for recording events, and not spans.
kaffarell pushed a commit to kaffarell/tracing that referenced this pull request May 22, 2024
Currently, `tracing`'s benchmark suite benchmarks the same behaviors
(e.g. creating a span, recording an event, etc) across a handful of
cases: with no default dispatcher, with a global default, and with a
scoped (thread-local) default. We use criterion's `benchmark_group` to
represent each kind of dispatcher, and `bench_function` for each
behavior being measured.

This is actually kind of backwards relative to how Criterion is
_supposed_ to be used. `bench_function` is intended for comparing
different implementations of the *same* thing, with the
`benchmark_group` representing what's being compared. If we inverted the
structure of these benchmarks, Criterion would give us nicer plots that
would allow comparing the performance of each dispatch type on the same
task.

This PR reorganizes the benchmarks so that each behavior being tested
(such as entering a span or recording an event) is a `benchmark_group`,
and each dispatch type (none, global, or scoped) is a `bench_function`
within that group. Now, Criterion will generate plots for each group
comparing the performance of each dispatch type in that benchmark.

For example, we now get nice comparisons like this:
![image](https://user-images.githubusercontent.com/2796466/175659314-835664ac-a8cf-4b07-91ee-f8cfee77bfbb.png)

Unfortunately, this required splitting each benchmark type out into its
own file. This is because, once we set the global default dispatcher
within one benchmark group, it would remain set for the entire lifetime
of the process --- there would be no way to test something else with no
global default. But, I think this is fine, even though it means we now
have a bunch of tiny files: it also allows us to run an individual
benchmark against every combination of dispatch types, without having to
run unrelated benches. This is potentially useful if (for example)
someone is changing only the code for recording events, and not spans.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant