Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rptest: fix cluster healthy check #16337

Merged
merged 8 commits into from
Feb 8, 2024

Conversation

andrewhsu
Copy link
Member

@andrewhsu andrewhsu commented Jan 27, 2024

Fixes https://github.com/redpanda-data/core-internal/issues/1003

New methodRedpandaServiceCloud.cluster_healthy() according to instructions on ensuring cluster is healthy.

Added assertions with cluster_healthy() to the start and end of the heavy tests that target cloud clusters.

Also added a simple self test for the new method:

ducktape \
  --debug \
  --globals=/home/ubuntu/redpanda/tests/globals.json \
  --cluster=ducktape.cluster.json.JsonCluster \
  --cluster-file=/home/ubuntu/redpanda/tests/cluster.json \
  tests/rptest/redpanda_cloud_tests/cloud_self_test.py

output:

test_id:    rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_simple
status:     PASS
run time:   30.319 seconds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-02-08--008
run time:         30.335 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
========================================================================================================================================================================================

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@andrewhsu andrewhsu marked this pull request as ready for review January 27, 2024 04:51
@andrewhsu
Copy link
Member Author

example test run in description, ready for review

# Under-replicated partitions (0): []

lines = ret.decode().splitlines()
self.logger.debug(f'rpk cluster health lines: {lines}')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feedback from travis: compare this healthy check with the one that is performed in bare metal situation

@@ -478,6 +478,8 @@ def get_broker_pod(self):

@cluster(num_nodes=2)
def test_throughput_simple(self):
assert self.redpanda.cluster_healthy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to go in one place which will be called automatically before and after the test. Maybe the "before" check in make_redpanda_service_cloud? and the after check in tearDown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the make_redpanda_service_cloud() looks to be called in just the __init__() method, i think a more suitable place would be setup() which is called before every test_* method and failures in setup() will fail the test.

it looks like ducktape won't fail a test if there is a failure in the tearDown() method, however. it will just log a warning: https://github.com/redpanda-data/ducktape/blob/62e0285f6b3a2f22fd4a43b5fdbc13be8d4290d9/ducktape/tests/runner_client.py#L295

so, i'll leave it at the end of each individual test unless there is a more suitable place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like ducktape won't fail a test if there is a failure in the tearDown() method, however. it will just log a warning:

Ah, that explains something that John had written in our variation of the @cluster decorator: they use it exactly to avoid this problem with tearDown.

So let's do this for now but we can switch to a decorator soon enough.

return True
break

return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good for now, but ideally we want the unhealthy reasons in the error. A variation of this which threw an exception could include that info.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

@andrewhsu
Copy link
Member Author

@travisdowns PR updated; ready for review

@andrewhsu
Copy link
Member Author

i add git commit acc2b01 because it seems like it can be used to satisfy https://github.com/redpanda-data/core-internal/issues/1013

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/44755#018d7f15-e115-4465-870c-3c3ec48a5043:

"rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=False.recovery=restart_recovery.compacted=False"

travisdowns
travisdowns previously approved these changes Feb 7, 2024
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@travisdowns
Copy link
Member

/ci-repeat

@andrewhsu
Copy link
Member Author

rebased with tip of dev to resolve merge conflicts. no functionality change.

travisdowns
travisdowns previously approved these changes Feb 8, 2024
savex
savex previously approved these changes Feb 8, 2024
Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andrewhsu
Copy link
Member Author

rebased again with tip of dev to resolve merge conflicts. no functionality change.

savex
savex previously approved these changes Feb 8, 2024
Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andrewhsu
Copy link
Member Author

i had to redo the rebase because it looks like the previous rebase to 367c7ea was missing 4 commits (not sure why).

i've also re-run the tests and updated output in the PR description to verify.

@piyushredpanda piyushredpanda merged commit 9545c8e into redpanda-data:dev Feb 8, 2024
18 checks passed
@andrewhsu andrewhsu deleted the cloud-healthy branch February 8, 2024 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants