Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write caching - raft - metrics #17179

Merged
merged 3 commits into from
Mar 29, 2024
Merged

write caching - raft - metrics #17179

merged 3 commits into from
Mar 29, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Mar 19, 2024

Adds a couple of simple metrics for visibility into write caching and raft in general

vectorized_raft_replicate_ack_all_requests_no_flush - # of quorum ack requests without flush (aka with write caching)
vectorized_raft_batch_size_bytes - a histogram of batch sizes as flushed by the replicate batcher, gives visibility into batching at raft layer.

Both are internal metrics at partition scope.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Adds metric for number of quorum ack requests with write caching.

src/v/raft/probe.cc Outdated Show resolved Hide resolved
@bharathv bharathv self-assigned this Mar 19, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 19, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/46415#018e5449-9ae8-4d83-ab4d-e32efc692a3a:

"rptest.tests.raft_availability_test.RaftAvailabilityTest.test_one_node_down"

new failures in https://buildkite.com/redpanda/redpanda/builds/46415#018e545a-543c-40be-a59c-79faa6c1fd15:

"rptest.tests.raft_availability_test.RaftAvailabilityTest.test_one_node_down"

new failures in https://buildkite.com/redpanda/redpanda/builds/46900#018e80f6-6c3a-4ea1-bc08-90cc59b9502a:

"rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_reset_from_cloud.cloud_storage_type=CloudStorageType.S3"

@bharathv
Copy link
Contributor Author

failure: #16561 unrelated.

dotnwat
dotnwat previously approved these changes Mar 23, 2024
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 235 to 236
for (auto& b : batches) {
total_batch_size += b.size_bytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth a note in the commit message or here about the difference between "batch"-size and batching efficiency. I think we are measuring how to the batcher accumulates things, but it's a little ambiguous since it is also counting batch-size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify a bit (renamed metric too), lmk if that makes it better.

but it's a little ambiguous since it is also counting batch-size.

I'm trying to measure how efficient batching is by measuring the size (in bytes) of a batch as that translates to size of a single append. So larger byte size batches translate to larger writes, this was the thought process. Added this info in the commit message.

@@ -156,6 +162,14 @@ void probe::setup_metrics(const model::ntp& ntp) {
[this] { return _full_heartbeat_requests; },
sm::description("Number of full heartbeats sent by the leader"),
labels),
sm::make_histogram(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there have been warnings about adding new per-ntp histograms because they put undue stress on the metrics-collecting infrastructure (even when aggregated across partitions). As this is more of a performance diagnostics metric, maybe we can get by with a single shard-local histogram without per-ntp granularity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removing this for now as discussed offline. Perhaps would be nice to have a mode that a user can enable on the fly (temporarily) to collect more granular metrics.

@bharathv
Copy link
Contributor Author

/ci-repeat 1

@bharathv bharathv merged commit d4b24d5 into redpanda-data:dev Mar 29, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants