-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write caching - raft - metrics #17179
Conversation
new failures in https://buildkite.com/redpanda/redpanda/builds/46415#018e5449-9ae8-4d83-ab4d-e32efc692a3a:
new failures in https://buildkite.com/redpanda/redpanda/builds/46415#018e545a-543c-40be-a59c-79faa6c1fd15:
new failures in https://buildkite.com/redpanda/redpanda/builds/46900#018e80f6-6c3a-4ea1-bc08-90cc59b9502a:
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46424#018e5506-0524-4f39-8ca7-f0f56ab82c04 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46424#018e5506-0528-4cf4-9d21-cc90302576f1 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46424#018e5518-064c-4684-b8d7-297a9d821396 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46900#018e80f6-6c40-448d-9bbb-25bcbb2355ab ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46900#018e80f2-157f-4026-a2a5-ba11e73c21b2 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47092#018e8bee-a99a-4e13-8434-931e6727e865 |
failure: #16561 unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
src/v/raft/replicate_batcher.cc
Outdated
for (auto& b : batches) { | ||
total_batch_size += b.size_bytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth a note in the commit message or here about the difference between "batch"-size and batching efficiency. I think we are measuring how to the batcher accumulates things, but it's a little ambiguous since it is also counting batch-size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to clarify a bit (renamed metric too), lmk if that makes it better.
but it's a little ambiguous since it is also counting batch-size.
I'm trying to measure how efficient batching is by measuring the size (in bytes) of a batch as that translates to size of a single append. So larger byte size batches translate to larger writes, this was the thought process. Added this info in the commit message.
src/v/raft/probe.cc
Outdated
@@ -156,6 +162,14 @@ void probe::setup_metrics(const model::ntp& ntp) { | |||
[this] { return _full_heartbeat_requests; }, | |||
sm::description("Number of full heartbeats sent by the leader"), | |||
labels), | |||
sm::make_histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there have been warnings about adding new per-ntp histograms because they put undue stress on the metrics-collecting infrastructure (even when aggregated across partitions). As this is more of a performance diagnostics metric, maybe we can get by with a single shard-local histogram without per-ntp granularity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removing this for now as discussed offline. Perhaps would be nice to have a mode that a user can enable on the fly (temporarily) to collect more granular metrics.
/ci-repeat 1 |
Adds a couple of simple metrics for visibility into write caching and raft in general
vectorized_raft_replicate_ack_all_requests_no_flush - # of quorum ack requests without flush (aka with write caching)
vectorized_raft_batch_size_bytes - a histogram of batch sizes as flushed by the replicate batcher, gives visibility into batching at raft layer.
Both are internal metrics at partition scope.
Backports Required
Release Notes
Improvements