Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add job spec in raycluster CRD #106

Closed
2 tasks done
chenk008 opened this issue Nov 29, 2021 · 13 comments
Closed
2 tasks done

[Feature] Add job spec in raycluster CRD #106

chenk008 opened this issue Nov 29, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request operator

Comments

@chenk008
Copy link
Contributor

chenk008 commented Nov 29, 2021

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

For now, we can use raycluster CR to setup a ray cluster. And the head node contains a dashboard which can handle job submission.

We've been talking about the same issue recently. We can support to start a job when the ray cluster is ready. Maybe it will be look like flink job cluster.

@Jeffwan @pcmoritz @DmitriGekhtman WDYT?

Use case

Create a ray cluster and run job automatically.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@chenk008 chenk008 added the enhancement New feature or request label Nov 29, 2021
@pcmoritz
Copy link
Collaborator

We should make sure the job format is compatible with https://docs.ray.io/en/master/ray-job-submission/overview.html

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 30, 2021

@pcmoritz Any further follow ups on the job level support at Kubernetes layer in anyscale?

@Jeffwan Jeffwan added this to the v0.3.0 release milestone Jan 19, 2022
@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 16, 2022

If we build the solution on top of https://docs.ray.io/en/latest/ray-job-submission/overview.html. that means

  1. operator launch the cluster
    2a. operator submit the job to remote cluster (if operator is the submitter, that means working dir has to be remote URIs? or build every into container)
    2b. operator create a dedicate object for job submission. probably a job client pod.
  2. operator polls job status if there's events. (reconcile is triggered by events, is it acceptable to have delays?)

We've talked about 2a in the past. (https://docs.google.com/document/d/1aKet8Zt8FLeZvsJGJeF2G_-9u2_UWf-AXgNRSvSQxcI/edit)

Does it sound like a reasonable path? In this case, operator does more than we original expected.

submit options probably can be reused in CRDs.

➜  ray job submit --help
Usage: ray job submit [OPTIONS] ENTRYPOINT...

  Submit a job to be executed on the cluster.

Options:
  --address TEXT           Address of the Ray cluster to connect to. Can also
                           be specified using the RAY_ADDRESS environment
                           variable.
  --job-id TEXT            Job ID to specify for the job. If not provided, one
                           will be generated.
  --runtime-env TEXT       Path to a local YAML file containing a runtime_env
                           definition.
  --runtime-env-json TEXT  JSON-serialized runtime_env dictionary.
  --working-dir TEXT       Directory containing files that your job will run
                           in. Can be a local directory or a remote URI to a
                           .zip file (S3, GS, HTTP). If specified, this
                           overrides the option in --runtime-env.
  --help                   Show this message and exit.

@chenk008
Copy link
Contributor Author

I prefer the 2b proposal

  1. Operator shouldn't couple with job manager, otherwise we need to modify the operator when the job manager apu is changed in the future.
    2.I like ray job submit in a dedicate object for job submission. Flink operator use batch.job to submit job.

@chenk008
Copy link
Contributor Author

chenk008 commented Feb 18, 2022

I will create some patch to finish this feature.
patch 1: Add job spec in raycluster CRD #153
patch 2: Let the ray-operator submit the job, this implementation detail is still under discussion
patch 3: Add some sample in config and docs

@simon-mo
Copy link
Collaborator

Following up on previous discussion in meeting. How about this alternative decouples job spec from cluster cr?

Before:

kind: RayCluster
spec:
  jobSpec: 
    entryPoint: 
    runtimeEnv: ..

After (two yamls but can be stiched together):

kind: RayCluster
label:
   name: myCluster
spec: ....
---
kind: RayJob
spec:
  jobSpec: 
    entryPoint: 
    runtimeEnv: ..
  clusterSelector:
    label:
      name: myCluster

This forces separate controller to manage this RayJob CR. Additionally, support separate versioning of RayJob and RayCluster. As RayJob is evolving faster than RayCluster.

The RayJob controller should also use the REST API for job submission and poll its status to ensure completion.

@harryge00
Copy link
Contributor

I have wrote a new design doc: https://docs.google.com/document/d/1z8IBoc0yWAPDe01Im2zKDg6NVkLSwj580Y5g4RwGLqg/edit?usp=sharing

would you like to take a look? @simon-mo @chenk008

@simon-mo
Copy link
Collaborator

Looks like we are going with the separate CR approach and controller approach? @Jeffwan are you ok with this?

@DmitriGekhtman
Copy link
Collaborator

cc @akanso
Re: Operator architecture

cc @brucez-anyscale
Re: Services

@DmitriGekhtman
Copy link
Collaborator

I don't think there have been objections yet to the multi-CR, multi-controller approach, one operator approach. It makes sense to me.
WDYT @akanso and @Jeffwan ?

@Jeffwan
Copy link
Collaborator

Jeffwan commented May 24, 2022

I missed the message. Yes. As we discussed in the community meeting. separate controllers manage their own CRD and we bake everything in single operator

/cc @simon-mo @DmitriGekhtman

@DmitriGekhtman
Copy link
Collaborator

@harryge00 @Jeffwan @simon-mo @edoakes @brucez-anyscale @shrekris-anyscale

I think it would be prudent to drop this item from the 0.3.0 release, given that the implementation is in draft form.

Is that fine? We can include this in a future release.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 19, 2022

PR is merged and we can close this issue.

@Jeffwan Jeffwan closed this as completed Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request operator
Projects
None yet
Development

No branches or pull requests

6 participants