Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Refactor test framework & test kuberay-operator chart with configuration framework #759

Merged
merged 18 commits into from
Dec 1, 2022

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 24, 2022

Why are these changes needed?

  1. Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

  2. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

  3. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

    The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

  4. Skip test_kill_head due to

    • I observed that sometimes workers crash when the head pod is killed.
  5. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

  6. Refactor and relieve flakiness of test_ray_serve_work

    • working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
    • To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
    • When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged.
    • Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
  7. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

  8. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.

Follow up issues (@jasoonn and I will finish them):

Related issue number

Part of #184

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Flakiness (no retry)

  • 445b61f: Pass
  • d599c71: Pass
  • d977b2b: Fail
    • Ray client cannot connect to Ray cluster (test_detach_actor_1.py) => This is weird, and may be caused by the weird pattern of Docker container and Kubernetes NodePort service.
    • curl: (7) Failed to connect to rayservice-sample-serve-svc.default.svc.cluster.local port 8000: Connection refused command terminated with exit code 7 => This should be fixed by [RayService] Track whether Serve app is ready before switching clusters #730
  • 829195b: Pass
  • 78849d6: Fail
    • Exception: There was an exception during the execution of test_ray_serve_1.py. => May be caused by Docker container and NodePort service.
  • 66107f9: Fail

@kevin85421 kevin85421 marked this pull request as ready for review November 26, 2022 08:12
@kevin85421
Copy link
Member Author

cc @jasoonn Would you mind reviewing this PR? Thank you!

Copy link
Contributor

@jasoonn jasoonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 Thanks for the contribution!

tests/compatibility-test.py Show resolved Hide resolved
tests/compatibility-test.py Outdated Show resolved Hide resolved
tests/compatibility-test.py Show resolved Hide resolved
tests/framework/prototype.py Show resolved Hide resolved
@DmitriGekhtman
Copy link
Collaborator

Skip test_kill_head.

Could you give context on that decision? Is the test complete nonsense or do we need to follow up -- do we need to fix it or write a test that covers similar logic?

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Nov 30, 2022

Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Could you make a note in the code and on the RayService test issue that

  • The previous logic for testing updates was problematic
  • We need to test RayService updates

@DmitriGekhtman
Copy link
Collaborator

Some minor style complaints above.

Overall this looks great! The KubeRay CI should be much greener after these changes.

REPO_ROOT = "/" + "/".join(os.path.abspath(os.path.dirname(__file__)).split('/')[1:-2]) + "/"
HELM_CHART_ROOT = REPO_ROOT + "helm-chart"
DEFAULT_KIND_CONFIG = REPO_ROOT + "tests/framework/config/kind-config.yaml"
REPO_ROOT = Path(__file__).absolute().parent.parent.parent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do .parents[3].
But .parent.parent.parent is fine too.

@kevin85421
Copy link
Member Author

Skip test_kill_head.

Could you give context on that decision? Is the test complete nonsense or do we need to follow up -- do we need to fix it or write a test that covers similar logic?

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!!

@kevin85421
Copy link
Member Author

@DmitriGekhtman These are wrong. These two issues are related to test_detached_actor and test_ray_serve. However, I observed that sometimes workers crash when the head pod is killed.

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…configuration framework (ray-project#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  ray-project#634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants