-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Refactor test framework & test kuberay-operator chart with configuration framework #759
[Feature] Refactor test framework & test kuberay-operator chart with configuration framework #759
Conversation
cc @jasoonn Would you mind reviewing this PR? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 Thanks for the contribution!
Could you give context on that decision? Is the test complete nonsense or do we need to follow up -- do we need to fix it or write a test that covers similar logic? |
Could you make a note in the code and on the RayService test issue that
|
Some minor style complaints above. Overall this looks great! The KubeRay CI should be much greener after these changes. |
23cc212
to
b76ca94
Compare
REPO_ROOT = "/" + "/".join(os.path.abspath(os.path.dirname(__file__)).split('/')[1:-2]) + "/" | ||
HELM_CHART_ROOT = REPO_ROOT + "helm-chart" | ||
DEFAULT_KIND_CONFIG = REPO_ROOT + "tests/framework/config/kind-config.yaml" | ||
REPO_ROOT = Path(__file__).absolute().parent.parent.parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do .parents[3].
But .parent.parent.parent is fine too.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!!
@DmitriGekhtman These are wrong. These two issues are related to |
…configuration framework (ray-project#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed ray-project#634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Why are these changes needed?
Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.
Refactor:
class CONST
andclass KubernetesClusterManager
should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.Refactor: Replace
os
withsubprocess
. The following paragraph is from Python's official documentation.Skip
test_kill_head
due toRefactor: Replace all existing k8s api clients with
K8S_CLUSTER_MANAGER
.Refactor and relieve flakiness of
test_ray_serve_work
working_dir
is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Updateworking_dir
inray-service.yaml.template
.time.sleep(60)
function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged.RayServiceTestCase
. Use acurl
Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container withnetwork_mode='host'
and NodePort service is very weird for me.Refactor: remove useless RayService template
ray-service-cluster-update.yaml.template
andray-service-serve-update.yaml.template
. The original buggy test logic only checks the exit code rather than theSTDOUT
of the curl commands. Hence, the different templates are useless inRayServiceTestCase
.Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Follow up issues (@jasoonn and I will finish them):
Related issue number
Part of #184
Checks
Flakiness (no retry)
test_detach_actor_1.py
) => This is weird, and may be caused by the weird pattern of Docker container and Kubernetes NodePort service.curl: (7) Failed to connect to rayservice-sample-serve-svc.default.svc.cluster.local port 8000: Connection refused command terminated with exit code 7
=> This should be fixed by [RayService] Track whether Serve app is ready before switching clusters #730Exception: There was an exception during the execution of test_ray_serve_1.py.
=> May be caused by Docker container and NodePort service.Exception: There was an exception during the execution of test_ray_serve_1.py.
curl: (7) Failed to connect to rayservice-sample-serve-svc.default.svc.cluster.local port 8000: Connection refused command terminated with exit code 7
=> This should be fixed by [RayService] Track whether Serve app is ready before switching clusters #730