Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-2070: cloud storage probe for compacted away bytes vectorized_ntp_archiver_compacted_replaced_bytes #17627

Merged

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Apr 4, 2024

vectorized_ntp_archiver_compacted_replaced_bytes is a gauge metric updated inside ntp_archiver::upload_manifest after the manifest is observable in cloud storage.

The value is updated when archival_metadata_stm adds a segment to the partition manifest and the operation results in a replacement operation.

The value is not part of the archival_metadata_stm state; it's only there for observability. As such, it is not replicated and is not part of the snapshot.
The value will reset when the manifest is created or deserialized.

Fixes https://github.com/redpanda-data/core-internal/issues/1236

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Features

  • new metric vectorized_ntp_archiver_compacted_replaced_bytes

src/v/cloud_storage/partition_manifest.h Outdated Show resolved Hide resolved
src/v/cloud_storage/partition_manifest.cc Outdated Show resolved Hide resolved
src/v/cloud_storage/partition_manifest.h Show resolved Hide resolved
src/v/cloud_storage/partition_manifest.h Show resolved Hide resolved
@dotnwat dotnwat changed the title cloud storage probe for compacted away bytes vectorized_ntp_archiver_compacted_replaced_bytes CORE-2070: cloud storage probe for compacted away bytes vectorized_ntp_archiver_compacted_replaced_bytes Apr 8, 2024
return a struct with the byte-size of the replaced segments and the
byte-size for the new segment.

useful to monitor change in cloud storage size due to compacted
reuploads
keep track in a counter of the result of the _manifest->add() operation,
if it's a segment reupload operation.

the counter is not persisted, so the value is reset every time the
manifest is recreated
vectorized_ntp_archiver_compacted_replaced_bytes
Perform update of the metric after the manifest is uploaded.

The intention is to not show in the metrics the value until the manifest
is visible in cloud_storage.
monitor vectorized_ntp_archiver_compaction_replaced_bytes, check that it
strictly increases if the leader hasn't changed.
@andijcr andijcr force-pushed the feat/cloud_storage_compaction_metric branch from 663d85f to 165e9fc Compare April 9, 2024 16:24
@andijcr andijcr requested a review from Lazin April 9, 2024 16:28
@andijcr
Copy link
Contributor Author

andijcr commented Apr 9, 2024

review feedback, hooked in a/tests/rptest/tests/shadow_indexing_compacted_topic_test.py to check that metric strictly increases

// compacted segment reuploaded note that this value does not correlate to
// the change in size in cloud storage until the original segment(s) are
// garbage collected.
size_t _compacted_replaced_bytes{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STM looks like the wrong place for archival metrics. Maybe, instead of maintaining additional counter here we can modify _active_operation_res to have type std::optional<ss::promise<result<add_segment_meta_result>>>. Then the value could be propagated to the ntp_archiver. And the ntp_archiver will update the metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the plumbing required would not be small, and the current mechanism of _active_operation_res is used by all the calls to archival_metadata_stm. I think it could be done, but it seems like a very big change just for this metric.

Another alternative is to wire the probe to the archival_metadata_stm so that it can update the counter with the result of partition_manifest::add. What do you think of this alternative?
The counter new value would be visible before the manifest is uploaded, but eventually, they should catch up. However, there would be some misreporting when the leadership changes in the middle of the operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it like this then.

compacted_replaced_bytes_increasing = m.evaluate([
(metric, lambda old, new: old < new)
])
new_topic_describe = list(self._rpk_client.describe_topic(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience this could be flaky because rpk can return an empty list in some cases. This could be fixed by using wait_until.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixed

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/47575#018ec3f1-9a9e-416d-8620-4296986619af:

"rptest.tests.topic_delete_test.TopicDeleteCloudStorageTest.topic_delete_installed_snapshots_test"

@andijcr andijcr requested a review from Lazin April 10, 2024 08:07
@andijcr
Copy link
Contributor Author

andijcr commented Apr 10, 2024

failures are gtest_raft_rpunit and #16744

@piyushredpanda piyushredpanda merged commit c4e9ab2 into redpanda-data:dev Apr 10, 2024
17 of 23 checks passed
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants