Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Add produce microbenchmark #16386

Merged
merged 1 commit into from
Feb 22, 2024
Merged

Conversation

mfleming
Copy link
Contributor

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@mfleming
Copy link
Contributor Author

Here's a flamegraph from a run of the benchmark.

out

@mfleming
Copy link
Contributor Author

Also worth noting

  • Writes to a single topic, single partition
  • It only uses RF=1
  • Doesn't use transactional / idempotent batches

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 31, 2024

std::vector<ss::future<>> dispatched;
std::vector<ss::future<kafka::produce_response::partition>> produced;

perf_tests::start_measuring_time();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, this is now being run 10k times in different fibers for a single benchmark loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, this isn't going to work correctly.

Since my original motiviation for cranking up the concurrency was to get a useful flamegraph, we can just remove the concurrency and simplify things.

@mfleming mfleming force-pushed the pr/15363 branch 2 times, most recently from f5c97ee to 1719c49 Compare February 5, 2024 12:13
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 5, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/44691#018d795d-0172-4938-bcef-8d5ec0e678bf:

"rptest.tests.offset_for_leader_epoch_test.OffsetForLeaderEpochTest.test_offset_for_leader_epoch"

new failures in https://buildkite.com/redpanda/redpanda/builds/45193#018dcdd0-f5c3-4ceb-8c88-3000864ecbb1:

"rptest.tests.data_transforms_test.DataTransformsLoggingTest.test_logs_volume"

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share a screenshot of the flamegraph with the produce bit zoomed in? Or share the whole file?

builder.add_raw_kv(iobuf{}, iobuf{});
}

auto batch = std::move(builder).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will this 100 batches or one? I thought one request can only contain a single batch per partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this variable is poorly named -- it's one batch with 100 records.

@mfleming
Copy link
Contributor Author

mfleming commented Feb 6, 2024

Flamegraph without concurrency.

out.svg.gz

produce_partition_fixture() {
BOOST_TEST_CHECKPOINT("before leadership");

wait_for_controller_leadership().get0();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer get()


constexpr size_t num_records = 100;
for (size_t i = 0; i < num_records; ++i) {
builder.add_raw_kv(iobuf{}, iobuf{});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put some data in the messages? Seems like something easy to make the test parameterizable on: the size of the payload.

dispatched.push_back(std::move(stages.dispatched));
produced.push_back(std::move(stages.produced));

co_await ss::when_all_succeed(dispatched.begin(), dispatched.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of the dispatched vector: it only ever has one element it in? Removing it would simplify this a lot and also allow a pure coroutine style probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Same for produced. Removing.

@@ -789,7 +789,7 @@ class redpanda_thread_fixture {
{ r.encode(writer, version) } -> std::same_as<void>;
}
kafka::request_context make_request_context(
RequestType request, kafka::request_header& header, conn_ptr conn = {}) {
RequestType& request, kafka::request_header& header, conn_ptr conn = {}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some detail here? This doesn't strike me as a safe change without some additional information, since callers may now see the request they are passing in mutated, as opposed to having the copy mutated (leaving the caller non-the-wiser).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right that this is a risk. I checked the code paths and didn't see any modifications to the request to thought this would be safe (though admittedly this doesn't guard against future changes).

Changing to a reference was necessary because partition_produce_data has a deleted default copy constructor due to its records field, e.g. std::optional<kafka::produce_request_record_data> records{}' and unlike kafka::fetch_requestit's not possible to copy akafka::produce_requestas-is. Would you prefer to see a user-defined copy constructor forpartition_produce_data` ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not modified, we could just make the parameter type const RequestType&?

However, I think the second part of your reply is suggesting it is modified: do we move out of request in order to avoid copying? "Moving out of" is a modification.

So I think if it is truly not modified, make it const. If it is modified, we should try to leave the signature alone move it at the call site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you're right, it is modified :)

Switched to doing a move at the call site.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continued the thread about the parameter type change. It's the only outstanding issue as far as I'm aware.

This one is simliar to the fetch plan benchmark.

Right now it only handles a single partition with RF=1 and doesn't use
transactional / idempotent batches.
@mfleming
Copy link
Contributor Author

I'm getting a timeout for gtest_raft_rpunit which @mmaslankaprv is looking at. But should I just merge this anyway since it only adds a new benchmark?

@StephanDollberg
Copy link
Member

Sounds fine to me. Will need actioning from @piyushredpanda

@piyushredpanda piyushredpanda merged commit 0b5bb0b into redpanda-data:dev Feb 22, 2024
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants