Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for external Redis and GCS HA #294

Merged
merged 15 commits into from
Jul 20, 2022

Conversation

wilsonwang371
Copy link
Collaborator

@wilsonwang371 wilsonwang371 commented Jun 8, 2022

Why are these changes needed?

Initial support for GCS HA in kuberay.

Design document is at: https://github.com/wilsonwang371/kuberay/blob/wilson/ray2.0_gcs_ha/docs/guidance/gcs-ha.md

Changes

  1. added yaml for Ray GCS HA + External Redis Deployment
  2. a document describing Ray GCS HA details.
  3. introduced new annotation in RayCluster to enable Ray GCS HA
  4. Added Event reconcile function so that in case of Readiness Probe failure, we can try to recover head or worker node if possible.
  5. Added Liveness and Readiness Probes based on new features introduced in [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs ray#26408
  6. Added Ray detached actor and Ray serve basic tests in Compatibility Test.
  7. Updated github workflow to reduce its running time

Enable GCS HA

To enable GCS HA, in RayCluster yaml file, a new annotation is required.

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  annotations:
    ray.io/ha-enabled: "true" # <- add this annotation enable GCS HA
...

When this annotation is added to RayCluster yaml file, all newly created head node and worker nodes pods in this ray cluster will have the same annotation attached.

ray.io/ha-enabled: "true" 

Readiness Probe & Liveness Probe

Readiness Probe and Liveness Probe are both used by kuberay to help Ray cluster recover from failures.

Readiness Probe is used to discover early failures and recover if possible. (recovery action is not implemented yet.)
Liveness Probe is used as the last resort to recover the cluster by restarting failed worker/head node

By default, if GCS HA is enabled, default Readiness Probe and default Liveness Probe will be added to the newly created pods in this Ray cluster. If user specified different Readiness Probe & Liveness Probe, default ones will not be added to the head & worker nodes created.

Related issue number

#290

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@wilsonwang371
Copy link
Collaborator Author

cc @iycheng @Jeffwan @scarlet25151

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 8, 2022

To be clear, does it work for 1.12.1 or just nightly version?

@wilsonwang371
Copy link
Collaborator Author

To be clear, does it work for 1.12.1 or just nightly version?

Right now it is for nightly because @iycheng 's changes are only available in nightly build. It might be available later in 1.13 or later versions. This need confirmation from Yi Cheng's side.

@wilsonwang371 wilsonwang371 force-pushed the wilson/ray2.0_gcs_ha branch 6 times, most recently from 92a57ff to 0444021 Compare June 22, 2022 23:04
@wilsonwang371 wilsonwang371 changed the title initial support for external redis and gcs ha [WIP]initial support for external redis and gcs ha Jun 23, 2022
@wilsonwang371 wilsonwang371 force-pushed the wilson/ray2.0_gcs_ha branch 20 times, most recently from 7d4436c to a4119e7 Compare June 23, 2022 23:48
@DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman for this.

@brucez-anyscale ,

@DmitriGekhtman for what? :)

pod.Spec.Containers[rayContainerIndex].LivenessProbe = probe
}
// add liveness probe exec command in case missing
initLivenessProbeHandler(pod.Spec.Containers[rayContainerIndex].LivenessProbe, rayNodeType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought for, maybe for later -- I wonder if these probes are useful for restarting pods even if HA isn't enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. we can have some discussion on this later.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Left a couple of comments on the new docs.

@DmitriGekhtman DmitriGekhtman mentioned this pull request Jul 19, 2022
4 tasks
@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 19, 2022

To justify the complexity of the readiness probe and event handling, we should, if we haven't yet, plan what kind of recovery logic we want to implement (besides deleting the head pod).

@DmitriGekhtman
Copy link
Collaborator

Are we good to merge this yet?

@brucez-anyscale
Copy link
Contributor

I think we can merge this and have followup prs to improve it

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 20, 2022

@DmitriGekhtman @brucez-anyscale waiting for a test fix. I am asking @wilsonwang371 to help fix it and then we can merge. It should be good by this noon

remove duplicated constants

Update docs/guidance/gcs-ha.md

Co-Authored-By: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
@wilsonwang371 wilsonwang371 force-pushed the wilson/ray2.0_gcs_ha branch 3 times, most recently from c12e66a to 7e9c82c Compare July 20, 2022 18:40
@wilsonwang371
Copy link
Collaborator Author

@iycheng it seems like external storage namespace has some issue. I temporarily disabled this, we may need to do some debug later on this.

@Jeffwan Jeffwan merged commit 2e0e9cd into ray-project:master Jul 20, 2022
@brucez-anyscale
Copy link
Contributor

@iycheng it seems like external storage namespace has some issue. I temporarily disabled this, we may need to do some debug later on this.

@iycheng @wilsonwang371 Let's follow up and solve it.

@wilsonwang371
Copy link
Collaborator Author

@iycheng it seems like external storage namespace has some issue. I temporarily disabled this, we may need to do some debug later on this.

@iycheng @wilsonwang371 Let's follow up and solve it.

I have posted a new MR. #406

@wilsonwang371 wilsonwang371 deleted the wilson/ray2.0_gcs_ha branch August 18, 2022 04:35
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* add implementation and test cases for ray ha

* bug fix

add ray serve test

use label to specify ray ha enabled or not

* bug fix

* update github pipeline

* update readiness & liveness command

* add Ray GCS HA document

* bug fix

* address comment from @Jeffwan

* use annotation for ray gcs ha

* remove redis install command

* typo fix and create function initTemplateAnnotations

* fix probe initial parameter issue

* support customized ray external storage namespace

* update GCS HA document

remove duplicated constants

Update docs/guidance/gcs-ha.md

Co-Authored-By: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>

* try not use UID as storage namespace

Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants