Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rptest: tolerate missing topic metrics after start #16472

Merged

Conversation

nvartolomei
Copy link
Contributor

During the recovery test in test_prefix_truncate_recovery, a node is restarted, and metrics are queried to wait until no partitions are under-replicated.

The test occasionally fails at assert len(metric.samples) == len(nodes) due to a missing metric for the test topic on the just-restarted node.

The redpanda server allows metrics to be queried very early in the server startup flow but the controller log which instructs the server which partition topics it manages is applied asynchronously.

To resolve this issue, we remove the assertion and instead rely on the fact that other servers will correctly report under-replicated partitions using all(map(lambda s: s.value == 0, metric.samples)).

Thus, the assert statement is both incorrect and redundant.

Fixes #16115

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@@ -69,7 +69,6 @@ def fully_replicated(self, nodes):
metric = self.redpanda.metrics_sample("under_replicated_replicas",
nodes)
metric = metric.label_filter(dict(namespace="kafka", topic=self.topic))
assert len(metric.samples) == len(nodes)
return all(map(lambda s: s.value == 0, metric.samples))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically all([]) is True,
since fully_replicated is called inside wait_until, maybe it should be

Suggested change
return all(map(lambda s: s.value == 0, metric.samples))
if len(metric.samples) != len(nodes):
return False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I have updated the method and believe it catches all edge cases in a generic way now.

tests/rptest/tests/prefix_truncate_recovery_test.py Outdated Show resolved Hide resolved
@andijcr
Copy link
Contributor

andijcr commented Feb 5, 2024

maybe for correctness we should have a wait_until for all the nodes (or at least one) to produce the metric

During the recovery test in `test_prefix_truncate_recovery`, a node is
restarted, and metrics are queried to wait until no partitions are
under-replicated.

The test occasionally fails at `assert len(metric.samples) ==
len(nodes)` due to a missing metric for the test topic on the
just-restarted node.

The redpanda server allows metrics to be queried very early in the
server startup flow but the controller log which instructs the server
which partition topics it manages is applied asynchronously.

To resolve this issue, we remove the assertion and instead rely on the
fact that other servers will correctly report under-replicated
partitions using `all(map(lambda s: s.value == 0, metric.samples))`.

Thus, the assert statement is both incorrect and redundant.

An extra check is added to ensure that metrics for all partitions under
test are present and the expression is evaluated for all of them. This
could happen due to a bug or all nodes containing replicas being
restarted and not producing metrics yet. A situation unlikely to happen
in this test but makes the `fully_replicated` method more general and
future proof.

Fixes redpanda-data#16115
@nvartolomei nvartolomei force-pushed the nv/test_prefix_truncate_recovery branch from f0b83dc to 586584f Compare February 7, 2024 11:01
@vbotbuildovich
Copy link
Collaborator

Copy link
Contributor

@andijcr andijcr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat solution

@nvartolomei nvartolomei merged commit d29c91d into redpanda-data:dev Feb 7, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (key symptom) in PrefixTruncateRecoveryTest.test_prefix_truncate_recovery
4 participants