Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cluster metadata manifest age metric #17404

Merged

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Mar 26, 2024

define a new metric redpanda_cluster_latest_cluster_metadata_manifest_age

To tracks the age (in seconds) of the cluster_metadata_manifest saved in cloud storage.

It's updated by the controller, and should result in a Sawtooth pattern, raising steadily and dropping to 0 when a new cluster_metadata_manifest in uploaded.

Fixes https://github.com/redpanda-data/core-internal/issues/1207

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Feature

  • new public metric redpanda_cluster_latest_cluster_metadata_manifest_age to track the age of the cluster_metadata_manifest in cloud storage

@andijcr andijcr force-pushed the feat/cluster_metadata_manifest_age_metric branch from 0e97f85 to 6561ddd Compare March 27, 2024 11:09
@andijcr andijcr marked this pull request as ready for review March 27, 2024 11:24
@andijcr andijcr requested a review from andrwng March 27, 2024 11:24
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 27, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fd6-7ef4-41a6-8262-7230c841fe62:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=restart_recovery.compacted=False"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_correctness_test"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.partition_count_decreases_on_deletion_test"
"rptest.tests.controller_log_limiting_test.ControllerAclsAndUsersLimitTest.test_create_user_limit"
"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_present"
"rptest.tests.rpk_cluster_test.RpkClusterTest.test_debug_bundle"
"rptest.tests.tls_metrics_test.TLSMetricsTest.test_services"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fd6-7ef9-4b86-bda0-1ba5447f32ed:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=no_recovery.compacted=False"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_reported_only_by_leader_test"
"rptest.tests.controller_log_limiting_test.ControllerConfigLimitTest.test_alter_configs_limit_accumulate"
"rptest.tests.controller_log_limiting_test.TopicOperationsLimitingTest.test_create_partition_limit"
"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_logger_metrics_present"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fd6-7efc-49a9-b41f-1d6b69d03a5b:

"rptest.tests.audit_log_test.AuditLogTestAdminApi.test_audit_log_metrics"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.max_offset_matches_committed_group_offset_test"
"rptest.tests.controller_log_limiting_test.TopicOperationsLimitingTest.test_create_partition_limit_accumulation"
"rptest.tests.tls_metrics_test.TLSMetricsTest.test_public_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fd6-7ef7-4b95-8ec4-03159352ac39:

"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_disabled_by_config_test"
"rptest.tests.controller_log_limiting_test.ControllerConfigLimitTest.test_alter_configs_limit"
"rptest.tests.node_metrics_test.NodeMetricsTest.test_node_storage_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fe8-c3f8-419c-acbf-8fd97656ed7a:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=restart_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=restart_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=restart_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=restart_recovery.compacted=True"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_correctness_test"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.partition_count_decreases_on_deletion_test"
"rptest.tests.controller_log_limiting_test.TopicOperationsLimitingTest.test_create_partition_limit"
"rptest.tests.tls_metrics_test.TLSMetricsTest.test_services"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fe8-c3f5-4b5d-9ed6-473939561df9:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=restart_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=restart_recovery.compacted=False"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.max_offset_matches_committed_group_offset_test"
"rptest.tests.controller_log_limiting_test.ControllerConfigLimitTest.test_alter_configs_limit_accumulate"
"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_present"
"rptest.tests.rpk_cluster_test.RpkClusterTest.test_debug_bundle"
"rptest.tests.tls_metrics_test.TLSMetricsTest.test_public_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fe8-c3fb-4b08-bb65-53b6c166ee34:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=no_recovery.compacted=False"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_disabled_by_config_test"
"rptest.tests.controller_log_limiting_test.ControllerAclsAndUsersLimitTest.test_create_user_limit"
"rptest.tests.controller_log_limiting_test.TopicOperationsLimitingTest.test_create_partition_limit_accumulation"
"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_logger_metrics_present"
"rptest.tests.node_metrics_test.NodeMetricsTest.test_node_storage_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/46874#018e7fe8-c3fe-493c-8257-62a4f1562008:

"rptest.tests.audit_log_test.AuditLogTestAdminApi.test_audit_log_metrics"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=False.recovery=no_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=no_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=True"
"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=1.unclean_abort=True.recovery=no_recovery.compacted=True"
"rptest.tests.partition_balancer_test.PartitionBalancerTest.test_movement_cancellations"
"rptest.tests.cluster_metrics_test.ClusterMetricsTest.cluster_metrics_reported_only_by_leader_test"
"rptest.tests.controller_log_limiting_test.ControllerConfigLimitTest.test_alter_configs_limit"

new failures in https://buildkite.com/redpanda/redpanda/builds/46914#018e82ae-5f15-413b-b330-46f460aa8919:

"rptest.tests.e2e_iam_role_test.STSRoleFetchTests.test_write"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 27, 2024

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

auto maybe_manifest_ref
= _controller.metadata_uploader().manifest();
if (!maybe_manifest_ref.has_value()) {
return int64_t{-1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 0 will be a more natural value to aggregate over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check if the manifest's timestamp is 0, and return 0 then too. It indicates we've never uploaded.

Comment on lines 116 to 119

auto age_s = std::chrono::duration_cast<std::chrono::seconds>(
now_ts - manifest.upload_time_since_epoch);
return int64_t{age_s.count()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're looking for a place to test this, I think cluster/cloud_metadata/tests/uploader_test.cc would be a good place. The full application is in that test, so we should have metrics.

field _metadata_uploader is an unique_ptr that gets initialized with a
value only if cloud storage is initialized.

change the return type to reflect the fact that the uploader might be
null
@andijcr andijcr force-pushed the feat/cluster_metadata_manifest_age_metric branch from 6561ddd to e866d80 Compare March 27, 2024 18:13
expose the how long ago (in seconds) the manifest was uploaded.

the metric is added only when the metadata_uploader is available
@andijcr andijcr force-pushed the feat/cluster_metadata_manifest_age_metric branch from e866d80 to 97b4131 Compare March 27, 2024 18:30
@@ -102,6 +104,41 @@ void controller_probe::setup_metrics() {
"Number of partitions that lack quorum among replicants"))
.aggregate({sm::shard_label}),
});

if (auto maybe_uploader = _controller.metadata_uploader()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please prefer if (auto x ...; x) instead of implicit cast to bool behavior.

@dotnwat
Copy link
Member

dotnwat commented Mar 28, 2024

Failure is #17412 popping up in a different test. They are all using the same base test bits so it shows up in different tests.

Other failure is #14139

@dotnwat dotnwat merged commit 61802ef into redpanda-data:dev Mar 28, 2024
14 of 17 checks passed
@andijcr andijcr deleted the feat/cluster_metadata_manifest_age_metric branch March 28, 2024 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants