-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rptest: better health check #16765
rptest: better health check #16765
Conversation
Add a variation of RedpandaServiceCloud.cluster_healthy() that returns the unhealhy reason line from RPK output when the cluster is unhealthy. This will aid diagnosability.
@@ -834,12 +840,15 @@ def _select_cluster_id(self): | |||
# if there is still no id, just copy what globals.json had | |||
return self.config.id | |||
|
|||
def create(self, superuser: Optional[SaslCredentials] = None) -> str: | |||
def create(self, superuser: SaslCredentials) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never supported None
superuser here (we'd throw immediately since we access it w/o checking if it's null) so just remove the optionality which simplifies a lot of downstream code.
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45426#018dedd8-120e-4f11-a4df-500db2d39ff6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one unnecessary import found
@@ -7,6 +7,7 @@ | |||
# the Business Source License, use of this software will be governed | |||
# by the Apache License, Version 2.0 | |||
|
|||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test the various "healthy" methods on RedpandaServiceCloud.
We had two (identical) definitions of this class, but identical doesn't mean "same type", so this doesn't type check. Move the definition into a new redpanda_types module (since we have to share a lot of basic types now between modules) and put an alias in redpanda.py (where it originally lived) to avoid having to change all the importers.
Remove support for user-specified superuser in cloud tests as it did not work (i.e., we immediately access a property on this parameter so it could never have been None in practice). This simplifies downstream code.
Add RedpandaServiceCloud.assert_cluster_is_reusable which is a more comprehensive version of cluster_healthy(). In addition to checking rpk's idea of health, it uses the CloudCluster _ensure_cluster health as an additional check. We run this at the end of cloud tests to ensure the cluster is still healthy. This will catch more reasons the cluster may be damaged. Issue redpanda-data#16331.
Use assert_cluster_is_reusable() as the end-of-test check in HTT and omb validation cloud tests, to catch more failure cases.
_metrics is a dict, not a list.
Move the good error handling from HTT._init__ to RedpandaServiceCloud so everyone can benefit and change HTT ctor to use that property.
In our end-of-test check for cloud tests, also check the broker counts. We check the expected number of brokers from the tier definition against the number of Redpanda pods reported by k8s as well as against the /brokers endpoint from panda proxy, which is the number of brokers the Redpanda cluster itself sees as part of the cluster. This helps catch cases where tests may leave the cluster in not reusable state because it has more or less brokers than the expected number.
When HTT.test_decom_and_add times out, make the throughput and exception a bit clearer.
e4c2154
to
ac07e97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, had a quick look over the commits and the additional details added would help with diagnose of health check errors we are seeing
if you get a chance, would be good to have an example run output on this to see it's doing the right thing:
ducktape \
--debug \
--globals=/home/ubuntu/redpanda/tests/globals.json \
--cluster=ducktape.cluster.json.JsonCluster \
--cluster-file=/home/ubuntu/redpanda/tests/cluster.json \
tests/rptest/redpanda_cloud_tests/cloud_self_test.py::SelfRedpandaCloudTest.test_healthy
Here's the debug log for a run of that test, is that what you were looking for? |
This series mostly deals with better health checks to catch cases where test failures (or tests that don't clean up after themselves even on success) may leave the cluster in an unusable state.
Individual commit messages have more details.
Backports Required
Release Notes