Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kuberay][autoscaler] Update KubeRay version to v1.0.0 #40918

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 2, 2023

Why are these changes needed?

  • Create a YAML file ray-cluster.autoscaler-template.yaml for testing instead of using the YAML file ray-cluster.autoscaler.yaml in the KubeRay repository.

    • Note that the tests assume that both head and worker Pods have exactly 1 CPU. Hence, if we set num-cpus: "0" in the head's rayStartParams, the current test logic would not work.
  • Why do I remove the test for "Confirming that the operator and autoscaler ignore pods marked for termination"?

    • KubeRay tries to ensure the number of runningPods, but the definition of runningPods is a bit different from different KubeRay versions.
      • Definition 1: For KubeRay v0.6.0 and older, the definition of runningPods is the Pods that are running or pending and not terminating.
      • Definition 2: For KubeRay v1.0.0, the definition of runningPods becomes that Pods that their Ray containers are not actually terminated. See [GCS FT] Consider the case of sidecar containers kuberay#1386 for more details.
      • That is, in definition 1, KubeRay may create new Pods if some Pods are in the terminating process. Hence, it is possible to have more than maxReplicas Pods & Ray nodes from both Kubernetes and Ray perspectives. In definition 2, KubeRay only creates new Pods when the Ray nodes are actually terminated.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@@ -79,7 +79,12 @@ def _get_ray_cr_config(
"""
with open(EXAMPLE_CLUSTER_PATH) as ray_cr_config_file:
ray_cr_config_str = ray_cr_config_file.read()
config = yaml.safe_load(ray_cr_config_str)

kuberay_crd_sets = set(["RayCluster", "RayJob", "RayService"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A YAML file may have multiple K8s objects.

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [kuberay][autoscaler] Update KubeRay version to v1.0.0-rc.2 [kuberay][autoscaler] Update KubeRay version to v1.0.0 Nov 6, 2023
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 marked this pull request as ready for review November 6, 2023 14:47
@architkulkarni
Copy link
Contributor

test_memory_pressure unrelated
Windows serve tests unrelated
Windows wheels failure unrelated: "RuntimeError: Detected Python version 3.7, which is not supported. Only Python 3.8, 3.9, 3.10, 3.11 are supported."

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 6, 2023
@architkulkarni architkulkarni merged commit 390738a into ray-project:master Nov 6, 2023
29 of 33 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…0918)

Create a YAML file ray-cluster.autoscaler-template.yaml for testing instead of using the YAML file ray-cluster.autoscaler.yaml in the KubeRay repository.

Note that the tests assume that both head and worker Pods have exactly 1 CPU. Hence, if we set num-cpus: "0" in the head's rayStartParams, the current test logic would not work.
Why do I remove the test for "Confirming that the operator and autoscaler ignore pods marked for termination"?

KubeRay tries to ensure the number of runningPods, but the definition of runningPods is a bit different from different KubeRay versions.
Definition 1: For KubeRay v0.6.0 and older, the definition of runningPods is the Pods that are running or pending and not terminating.
Definition 2: For KubeRay v1.0.0, the definition of runningPods becomes that Pods that their Ray containers are not actually terminated. See [GCS FT] Consider the case of sidecar containers kuberay#1386 for more details.
That is, in definition 1, KubeRay may create new Pods if some Pods are in the terminating process. Hence, it is possible to have more than maxReplicas Pods & Ray nodes from both Kubernetes and Ray perspectives. In definition 2, KubeRay only creates new Pods when the Ray nodes are actually terminated.

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants