Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler][kubernetes][operator] Rudimentary error handling, make "MODIFIED" -> update event work. #13756

Merged
merged 5 commits into from
Feb 4, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Jan 28, 2021

Why are these changes needed?

(A)
This PR adds to the CRD a status subresource with a phase field.
The phase field is either Running when things are as normal or Error when there's an Python exception thrown processing one of the ADDED/MODIFIED/DELETED events for a RayCluster CR.
(Previously an error would completely stop the operator.)

The status is printed with kubectl get raycluster. For example:
Screen Shot 2021-01-27 at 9 17 38 PM

(B)
This PR also fixes the operator's update logic -- previously, upon receiving a "MODIFIED" event for a CR, it would relaunch the cluster without updating the cluster config . The config is now modified.

If this PR is merged, there would be an incompatibility between versions of operator.yaml slightly older than latest master and the operator code in rayproject/ray:nightly which is the image used in operator.yaml.
(The new operator code acts on rayclusters/status but the old operator.yaml doesn't grant that permission.)

Partly motivated by that, I've added a note to operator documentation saying that the operator is experimental and the config files from latest master should be used.

Related issue number

Closes #13669

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Added logic testing that cluster update works to operator unit test. Unit test passes locally.
Manually tested the Error status logic by inserting nonsense into the MODIFIED elif branch. Figuring out how to unit test this is too much effort for now.

@DmitriGekhtman DmitriGekhtman added this to the Serverless Autoscaling milestone Jan 28, 2021
@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 1, 2021
@DmitriGekhtman DmitriGekhtman changed the title [autoscaler][kubernetes][operator] Rudimentary error handling [autoscaler][kubernetes][operator] Rudimentary error handling, make "MODIFIED" -> update event work. Feb 2, 2021
@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 2, 2021
@DmitriGekhtman
Copy link
Contributor Author

@yiranwang52 @edoakes
I'm done debugging this -- could you take a quick look at some point?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the k8s APIs used here but looks reasonable to me

@DmitriGekhtman
Copy link
Contributor Author

@edoakes Could you merge this?

@edoakes edoakes merged commit 1187d1d into ray-project:master Feb 4, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
…, make "MODIFIED" -> update event work. (ray-project#13756)"

This reverts commit 3749d2d.
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
…, make "MODIFIED" -> update event work. (ray-project#13756)"

This reverts commit 3749d2d.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler][kubernetes][operator] Operator error handling
4 participants