-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-2070: cloud storage probe for compacted away bytes vectorized_ntp_archiver_compacted_replaced_bytes
#17627
CORE-2070: cloud storage probe for compacted away bytes vectorized_ntp_archiver_compacted_replaced_bytes
#17627
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47386#018ea9b4-89c3-4b74-9762-2ef3a8295976 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47386#018ea9b4-89cc-4937-b8bf-21230c9a94fa ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47575#018ec3f1-9a9e-416d-8620-4296986619af ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47575#018ec401-88ea-4bae-a6eb-40785091da72 |
vectorized_ntp_archiver_compacted_replaced_bytes
vectorized_ntp_archiver_compacted_replaced_bytes
return a struct with the byte-size of the replaced segments and the byte-size for the new segment. useful to monitor change in cloud storage size due to compacted reuploads
keep track in a counter of the result of the _manifest->add() operation, if it's a segment reupload operation. the counter is not persisted, so the value is reset every time the manifest is recreated
vectorized_ntp_archiver_compacted_replaced_bytes
Perform update of the metric after the manifest is uploaded. The intention is to not show in the metrics the value until the manifest is visible in cloud_storage.
monitor vectorized_ntp_archiver_compaction_replaced_bytes, check that it strictly increases if the leader hasn't changed.
663d85f
to
165e9fc
Compare
review feedback, hooked in a/tests/rptest/tests/shadow_indexing_compacted_topic_test.py to check that metric strictly increases |
// compacted segment reuploaded note that this value does not correlate to | ||
// the change in size in cloud storage until the original segment(s) are | ||
// garbage collected. | ||
size_t _compacted_replaced_bytes{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STM looks like the wrong place for archival metrics. Maybe, instead of maintaining additional counter here we can modify _active_operation_res
to have type std::optional<ss::promise<result<add_segment_meta_result>>>
. Then the value could be propagated to the ntp_archiver
. And the ntp_archiver
will update the metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the plumbing required would not be small, and the current mechanism of _active_operation_res is used by all the calls to archival_metadata_stm. I think it could be done, but it seems like a very big change just for this metric.
Another alternative is to wire the probe to the archival_metadata_stm so that it can update the counter with the result of partition_manifest::add. What do you think of this alternative?
The counter new value would be visible before the manifest is uploaded, but eventually, they should catch up. However, there would be some misreporting when the leadership changes in the middle of the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it like this then.
compacted_replaced_bytes_increasing = m.evaluate([ | ||
(metric, lambda old, new: old < new) | ||
]) | ||
new_topic_describe = list(self._rpk_client.describe_topic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience this could be flaky because rpk can return an empty list in some cases. This could be fixed by using wait_until
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, fixed
new failures in https://buildkite.com/redpanda/redpanda/builds/47575#018ec3f1-9a9e-416d-8620-4296986619af:
|
failures are gtest_raft_rpunit and #16744 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
vectorized_ntp_archiver_compacted_replaced_bytes
is a gauge metric updated insidentp_archiver::upload_manifest
after the manifest is observable in cloud storage.The value is updated when
archival_metadata_stm
adds a segment to the partition manifest and the operation results in a replacement operation.The value is not part of the archival_metadata_stm state; it's only there for observability. As such, it is not replicated and is not part of the snapshot.
The value will reset when the manifest is created or deserialized.
Fixes https://github.com/redpanda-data/core-internal/issues/1236
Backports Required
Release Notes
Features