Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RayJob CRD and controller logic #303

Merged
merged 7 commits into from
Jul 18, 2022

Conversation

harryge00
Copy link
Contributor

@harryge00 harryge00 commented Jun 14, 2022

Why are these changes needed?

Add RayJob CRD and its controller.
Includes:

  1. Get or create corresponding Rayclusters for ray jobs.
  2. Connect ray dashboard
  3. Submit ray job and check job status.

This PR borrows a lot from @brucez-anyscale , thanks so much.

Related issue number

#106

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 14, 2022

Thanks for the contribution! I will have a check today

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some core logic in job controller is not ready yet. I left some comments and I will come back later for another round of review.

ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/config/crd/bases/ray.io_rayservices.yaml Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@brucez-anyscale brucez-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Please update the naming in job controller and do some manual integration test to ensure it works.

ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
@@ -0,0 +1,7 @@
# The following patch adds a directive for certmanager to inject CA into the CRD
apiVersion: apiextensions.k8s.io/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 2, 2022

@harryge00 Did you get a chance to update the PR and address the comments? We are close to v0.3.0 release and we plan to include this feature in the release.

@harryge00 harryge00 changed the title [WIP] Add RayJob CRD and controller logic Add RayJob CRD and controller logic Jul 12, 2022
@harryge00
Copy link
Contributor Author

@Jeffwan @brucez-anyscale I have just updated this PR, could u take another review?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 12, 2022

@harryge00 Could you help fix the linter and build issue?

ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1alpha1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 14, 2022

I think the polished version is close to complete. Great work! Can you help address my above comments @harryge00

@brucez-anyscale
Copy link
Contributor

We should at least add unit tests. Also please manually test it in EKS or GKE or kind cluster.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 18, 2022

@harryge00 Please help fix the build issue and lint issues. The logics look good to me and we can file follow up PRs to address potential issue.

image


// Set rayClusterName and rayJobId first, to avoid duplicate submission
err = r.setRayJobIdAndRayClusterNameIfNeed(ctx, rayJobInstance)
if err != nil {
Copy link
Contributor Author

@harryge00 harryge00 Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brucez-anyscale
#303 (comment)
I now save cluster name and job Id before reconciling rayJob. In this case, rayCluster will not be created duplicately because only the first updating of status will succeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update status in setRayJobIdAndRayClusterNameIfNeed, you should better just return this reconcile loop.
Or you should get rayJobInstance again, since the resource version of rayJobInstance has changed.

@harryge00
Copy link
Contributor Author

@Jeffwan @brucez-anyscale Could you take another review? I have address the remaining issues

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 18, 2022

@harryge00 Awesome! There're few issues need to be improved later, I can help on it

  1. jobDeploymentStatus seems stay at Running stage for successful runs
  2. spec.jobId is not being used to custom job ids
  3. spec.shutdownAfterJobFinishes is not implemented yet
  4. provide easier way to fetch job failure message from CR
  5. I do see frequent state update which can be optimized by comparing the before and after status.
  6. log consistency problem.

dashboardAgentService.Name,
dashboardAgentService.Namespace,
dashboardPort)
log.V(1).Info("fetchDashboardAgentURL ", "dashboardURL", dashboardAgentURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.V(1).Info("fetchDashboardAgentURL ", "dashboardAgentURL", dashboardAgentURL)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brucez-anyscale Seem this is the key change dashboardURL -> dashboardAgentURL? I can help update it in follow up PRs and let's merge this one. (We need this change in downstream to unblock other stories..)

@brucez-anyscale
Copy link
Contributor

Great! Generally looks good. Pls address my comments and then feel free to merge.

@Jeffwan Jeffwan merged commit e1e41c3 into ray-project:master Jul 18, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* Generated RayJob CRD

* Add the RayJob controller

* Update vendor

* Update generated code

* Add unit tests

* Refactor rayJob controller

* Update to pass CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants