Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rptest: better health check #16765

Merged

Conversation

travisdowns
Copy link
Member

This series mostly deals with better health checks to catch cases where test failures (or tests that don't clean up after themselves even on success) may leave the cluster in an unusable state.

Individual commit messages have more details.

Backports Required

  • none - cloud DT test
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

Add a variation of RedpandaServiceCloud.cluster_healthy() that returns
the unhealhy reason line from RPK output when the cluster is unhealthy.

This will aid diagnosability.
@@ -834,12 +840,15 @@ def _select_cluster_id(self):
# if there is still no id, just copy what globals.json had
return self.config.id

def create(self, superuser: Optional[SaslCredentials] = None) -> str:
def create(self, superuser: SaslCredentials) -> str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never supported None superuser here (we'd throw immediately since we access it w/o checking if it's null) so just remove the optionality which simplifies a lot of downstream code.

@vbotbuildovich
Copy link
Collaborator

savex
savex previously approved these changes Feb 28, 2024
Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one unnecessary import found

@@ -7,6 +7,7 @@
# the Business Source License, use of this software will be governed
# by the Apache License, Version 2.0

import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@savex - yes good call, I had used time for debugging. Fixed in push ac07e97

Copy link
Contributor

@rpdevmp rpdevmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Test the various "healthy" methods on RedpandaServiceCloud.
We had two (identical) definitions of this class, but identical doesn't
mean "same type", so this doesn't type check.

Move the definition into a new redpanda_types module (since we have
to share a lot of basic types now between modules) and put an alias
in redpanda.py (where it originally lived) to avoid having to change
all the importers.
Remove support for user-specified superuser in cloud tests as it did
not work (i.e., we immediately access a property on this parameter so
it could never have been None in practice). This simplifies downstream
code.
Add RedpandaServiceCloud.assert_cluster_is_reusable which is a more
comprehensive version of cluster_healthy(). In addition to checking
rpk's idea of health, it uses the CloudCluster _ensure_cluster health
as an additional check.

We run this at the end of cloud tests to ensure the cluster is still
healthy. This will catch more reasons the cluster may be damaged.

Issue redpanda-data#16331.
Use assert_cluster_is_reusable() as the end-of-test check in HTT and
omb validation cloud tests, to catch more failure cases.
Move the good error handling from HTT._init__ to RedpandaServiceCloud
so everyone can benefit and change HTT ctor to use that property.
In our end-of-test check for cloud tests, also check the broker counts.

We check the expected number of brokers from the tier definition against
the number of Redpanda pods reported by k8s as well as against the
/brokers endpoint from panda proxy, which is the number of brokers
the Redpanda cluster itself sees as part of the cluster.

This helps catch cases where tests may leave the cluster in not
reusable state because it has more or less brokers than the expected
number.
When HTT.test_decom_and_add times out, make the throughput and
exception a bit clearer.
Copy link
Member

@andrewhsu andrewhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had a quick look over the commits and the additional details added would help with diagnose of health check errors we are seeing

if you get a chance, would be good to have an example run output on this to see it's doing the right thing:

ducktape \
  --debug \
  --globals=/home/ubuntu/redpanda/tests/globals.json \
  --cluster=ducktape.cluster.json.JsonCluster \
  --cluster-file=/home/ubuntu/redpanda/tests/cluster.json \
  tests/rptest/redpanda_cloud_tests/cloud_self_test.py::SelfRedpandaCloudTest.test_healthy

@travisdowns travisdowns merged commit 1c8edac into redpanda-data:dev Feb 29, 2024
16 checks passed
@travisdowns
Copy link
Member Author

@andrewhsu @rpdevmp

if you get a chance, would be good to have an example run output on this to see it's doing the right thing:

test_log.debug.gz

Here's the debug log for a run of that test, is that what you were looking for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants