-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for external Redis and GCS HA #294
Initial support for external Redis and GCS HA #294
Conversation
cc @iycheng @Jeffwan @scarlet25151 |
To be clear, does it work for 1.12.1 or just nightly version? |
Right now it is for nightly because @iycheng 's changes are only available in nightly build. It might be available later in 1.13 or later versions. This need confirmation from Yi Cheng's side. |
92a57ff
to
0444021
Compare
7d4436c
to
a4119e7
Compare
@DmitriGekhtman for what? :) |
pod.Spec.Containers[rayContainerIndex].LivenessProbe = probe | ||
} | ||
// add liveness probe exec command in case missing | ||
initLivenessProbeHandler(pod.Spec.Containers[rayContainerIndex].LivenessProbe, rayNodeType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought for, maybe for later -- I wonder if these probes are useful for restarting pods even if HA isn't enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. we can have some discussion on this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Left a couple of comments on the new docs.
To justify the complexity of the readiness probe and event handling, we should, if we haven't yet, plan what kind of recovery logic we want to implement (besides deleting the head pod). |
Are we good to merge this yet? |
I think we can merge this and have followup prs to improve it |
@DmitriGekhtman @brucez-anyscale waiting for a test fix. I am asking @wilsonwang371 to help fix it and then we can merge. It should be good by this noon |
remove duplicated constants Update docs/guidance/gcs-ha.md Co-Authored-By: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
c12e66a
to
7e9c82c
Compare
7e9c82c
to
6aeb740
Compare
@iycheng it seems like external storage namespace has some issue. I temporarily disabled this, we may need to do some debug later on this. |
@iycheng @wilsonwang371 Let's follow up and solve it. |
I have posted a new MR. #406 |
* add implementation and test cases for ray ha * bug fix add ray serve test use label to specify ray ha enabled or not * bug fix * update github pipeline * update readiness & liveness command * add Ray GCS HA document * bug fix * address comment from @Jeffwan * use annotation for ray gcs ha * remove redis install command * typo fix and create function initTemplateAnnotations * fix probe initial parameter issue * support customized ray external storage namespace * update GCS HA document remove duplicated constants Update docs/guidance/gcs-ha.md Co-Authored-By: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> * try not use UID as storage namespace Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Why are these changes needed?
Initial support for GCS HA in kuberay.
Design document is at: https://github.com/wilsonwang371/kuberay/blob/wilson/ray2.0_gcs_ha/docs/guidance/gcs-ha.md
Changes
Enable GCS HA
To enable GCS HA, in RayCluster yaml file, a new annotation is required.
When this annotation is added to RayCluster yaml file, all newly created head node and worker nodes pods in this ray cluster will have the same annotation attached.
Readiness Probe & Liveness Probe
Readiness Probe and Liveness Probe are both used by kuberay to help Ray cluster recover from failures.
Readiness Probe is used to discover early failures and recover if possible. (recovery action is not implemented yet.)
Liveness Probe is used as the last resort to recover the cluster by restarting failed worker/head node
By default, if GCS HA is enabled, default Readiness Probe and default Liveness Probe will be added to the newly created pods in this Ray cluster. If user specified different Readiness Probe & Liveness Probe, default ones will not be added to the head & worker nodes created.
Related issue number
#290
Checks