-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rptest: tolerate missing topic metrics after start #16472
rptest: tolerate missing topic metrics after start #16472
Conversation
@@ -69,7 +69,6 @@ def fully_replicated(self, nodes): | |||
metric = self.redpanda.metrics_sample("under_replicated_replicas", | |||
nodes) | |||
metric = metric.label_filter(dict(namespace="kafka", topic=self.topic)) | |||
assert len(metric.samples) == len(nodes) | |||
return all(map(lambda s: s.value == 0, metric.samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically all([]) is True,
since fully_replicated is called inside wait_until, maybe it should be
return all(map(lambda s: s.value == 0, metric.samples)) | |
if len(metric.samples) != len(nodes): | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I have updated the method and believe it catches all edge cases in a generic way now.
maybe for correctness we should have a wait_until for all the nodes (or at least one) to produce the metric |
During the recovery test in `test_prefix_truncate_recovery`, a node is restarted, and metrics are queried to wait until no partitions are under-replicated. The test occasionally fails at `assert len(metric.samples) == len(nodes)` due to a missing metric for the test topic on the just-restarted node. The redpanda server allows metrics to be queried very early in the server startup flow but the controller log which instructs the server which partition topics it manages is applied asynchronously. To resolve this issue, we remove the assertion and instead rely on the fact that other servers will correctly report under-replicated partitions using `all(map(lambda s: s.value == 0, metric.samples))`. Thus, the assert statement is both incorrect and redundant. An extra check is added to ensure that metrics for all partitions under test are present and the expression is evaluated for all of them. This could happen due to a bug or all nodes containing replicas being restarted and not producing metrics yet. A situation unlikely to happen in this test but makes the `fully_replicated` method more general and future proof. Fixes redpanda-data#16115
f0b83dc
to
586584f
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44806#018d837b-9274-4511-bd72-57ca4ec5a60e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat solution
/backport v23.3.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
During the recovery test in
test_prefix_truncate_recovery
, a node is restarted, and metrics are queried to wait until no partitions are under-replicated.The test occasionally fails at
assert len(metric.samples) == len(nodes)
due to a missing metric for the test topic on the just-restarted node.The redpanda server allows metrics to be queried very early in the server startup flow but the controller log which instructs the server which partition topics it manages is applied asynchronously.
To resolve this issue, we remove the assertion and instead rely on the fact that other servers will correctly report under-replicated partitions using
all(map(lambda s: s.value == 0, metric.samples))
.Thus, the assert statement is both incorrect and redundant.
Fixes #16115
Backports Required
Release Notes