rptest: tolerate missing topic metrics after start #16472

nvartolomei · 2024-02-05T12:15:48Z

During the recovery test in test_prefix_truncate_recovery, a node is restarted, and metrics are queried to wait until no partitions are under-replicated.

The test occasionally fails at assert len(metric.samples) == len(nodes) due to a missing metric for the test topic on the just-restarted node.

The redpanda server allows metrics to be queried very early in the server startup flow but the controller log which instructs the server which partition topics it manages is applied asynchronously.

To resolve this issue, we remove the assertion and instead rely on the fact that other servers will correctly report under-replicated partitions using all(map(lambda s: s.value == 0, metric.samples)).

Thus, the assert statement is both incorrect and redundant.

Fixes #16115

Backports Required

Release Notes

none

andijcr · 2024-02-05T15:41:56Z

tests/rptest/tests/prefix_truncate_recovery_test.py

@@ -69,7 +69,6 @@ def fully_replicated(self, nodes):
        metric = self.redpanda.metrics_sample("under_replicated_replicas",
                                              nodes)
        metric = metric.label_filter(dict(namespace="kafka", topic=self.topic))
-        assert len(metric.samples) == len(nodes)
        return all(map(lambda s: s.value == 0, metric.samples))


technically all([]) is True,
since fully_replicated is called inside wait_until, maybe it should be

Suggested change

return all(map(lambda s: s.value == 0, metric.samples))

if len(metric.samples) != len(nodes):

return False

Thanks for catching this. I have updated the method and believe it catches all edge cases in a generic way now.

tests/rptest/tests/prefix_truncate_recovery_test.py

andijcr · 2024-02-05T15:50:42Z

maybe for correctness we should have a wait_until for all the nodes (or at least one) to produce the metric

During the recovery test in `test_prefix_truncate_recovery`, a node is restarted, and metrics are queried to wait until no partitions are under-replicated. The test occasionally fails at `assert len(metric.samples) == len(nodes)` due to a missing metric for the test topic on the just-restarted node. The redpanda server allows metrics to be queried very early in the server startup flow but the controller log which instructs the server which partition topics it manages is applied asynchronously. To resolve this issue, we remove the assertion and instead rely on the fact that other servers will correctly report under-replicated partitions using `all(map(lambda s: s.value == 0, metric.samples))`. Thus, the assert statement is both incorrect and redundant. An extra check is added to ensure that metrics for all partitions under test are present and the expression is evaluated for all of them. This could happen due to a bug or all nodes containing replicas being restarted and not producing metrics yet. A situation unlikely to happen in this test but makes the `fully_replicated` method more general and future proof. Fixes redpanda-data#16115

vbotbuildovich · 2024-02-07T13:33:44Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44806#018d837b-9274-4511-bd72-57ca4ec5a60e

andijcr

neat solution

vbotbuildovich · 2024-02-07T14:37:27Z

/backport v23.3.x

dotnwat

👍

nvartolomei requested review from dotnwat, bharathv, andijcr and abhijat February 5, 2024 14:56

andijcr reviewed Feb 5, 2024

View reviewed changes

nvartolomei force-pushed the nv/test_prefix_truncate_recovery branch from f0b83dc to 586584f Compare February 7, 2024 11:01

andijcr approved these changes Feb 7, 2024

View reviewed changes

nvartolomei merged commit d29c91d into redpanda-data:dev Feb 7, 2024
17 checks passed

This was referenced Feb 7, 2024

[v23.3.x] CI Failure (key symptom) in PrefixTruncateRecoveryTest.test_prefix_truncate_recovery #16514

Closed

[v23.3.x] rptest: tolerate missing topic metrics after start #16515

Merged

dotnwat reviewed Feb 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rptest: tolerate missing topic metrics after start #16472

rptest: tolerate missing topic metrics after start #16472

nvartolomei commented Feb 5, 2024

andijcr Feb 5, 2024

nvartolomei Feb 7, 2024

andijcr commented Feb 5, 2024

vbotbuildovich commented Feb 7, 2024

andijcr left a comment

vbotbuildovich commented Feb 7, 2024

dotnwat left a comment

	return all(map(lambda s: s.value == 0, metric.samples))
	if len(metric.samples) != len(nodes):
	return False

rptest: tolerate missing topic metrics after start #16472

rptest: tolerate missing topic metrics after start #16472

Conversation

nvartolomei commented Feb 5, 2024

Backports Required

Release Notes

andijcr Feb 5, 2024

Choose a reason for hiding this comment

nvartolomei Feb 7, 2024

Choose a reason for hiding this comment

andijcr commented Feb 5, 2024

vbotbuildovich commented Feb 7, 2024

andijcr left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Feb 7, 2024

dotnwat left a comment

Choose a reason for hiding this comment