Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][RayJob] Support light-weight job submission #1893

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 31, 2024

Why are these changes needed?

Some users are not willing to allocate additional compute resources for the submitter Kubernetes Job. Hence, this PR implements the light-weight job submission which uses a HTTP request to submit the Ray job.

Related issue number

Closes #1558

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

WithEntrypoint("python /home/ray/jobs/counter.py").
WithRuntimeEnvYAML(`
env_vars:
counter_name: test_counter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: If the indentation is set to 4 spaces, the YAML string will fail to unmarshal correctly. Maybe we should verify whether runtimeEnvYAML is a valid YAML or not.

@@ -197,10 +198,10 @@ type RayJobInfo struct {
}

// RayJobRequest is the request body to submit.
// Reference to https://docs.ray.io/en/latest/cluster/jobs-package-ref.html#jobsubmissionclient.
// Reference to https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html#ray-job-rest-api-spec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAPI doc is currently manually written and might itself have bugs; it might be better to use the Python as the source of truth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And if we notice something is off, we can update the OpenAPI doc)

@@ -222,6 +223,7 @@ type RayJobLogsResponse struct {
}

// Note that RayJobInfo and error can't be nil at the same time.
// Please make sure if the Ray job with JobId can't be found. Return a BadRequest error.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat unusual, but the function NewNotFound requires schema.GroupResource as a parameter.

func NewNotFound(qualifiedResource schema.GroupResource, name string) *StatusError

@@ -222,6 +223,7 @@ type RayJobLogsResponse struct {
}

// Note that RayJobInfo and error can't be nil at the same time.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: KubeRay uses many different methods to create errors. We should try to unify them.

  • "k8s.io/apimachinery/pkg/api/errors"
  • "github.com/pkg/errors"
  • fmt.Errorf

@kevin85421 kevin85421 changed the title WIP [Feature][RayJob] Support light-weight job submission Feb 1, 2024
@kevin85421 kevin85421 marked this pull request as ready for review February 1, 2024 00:09
@kevin85421
Copy link
Member Author

cc @andrewsykim @astefanutti

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks reasonable and the code looks good to me! In the accompanying Ray doc PR, we can add a brief explanation of the pros and cons of setting this parameter.

@@ -197,6 +202,15 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
rayDashboardClient.InitClient(rayJobInstance.Status.DashboardURL)
jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId)
if err != nil {
// If the Ray job was not found, GetJobInfo returns a BadRequest error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I would expect it to return a 404 according to the doc (https://docs.ray.io/en/latest/cluster/running-applications/job-submission/api.html#/paths/~1api~1jobs~1%7Bsubmission_id%7D/get), but the doc is not automatically generated and it might be inaccurate. Do you think it's worth changing the behavior in Ray to make it return 404? I don't think any other users have reported it, so maybe it's not important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ray RESTful API will return 404, but the function GetJobInfo returns BadRequest here. The reason that I decide not to use NewNotFound at this moment is that it requires to pass schema.GroupResource into the function (example). We should figure out a better way for "errors". See #1893 (comment) for more details.

@@ -197,10 +198,10 @@ type RayJobInfo struct {
}

// RayJobRequest is the request body to submit.
// Reference to https://docs.ray.io/en/latest/cluster/jobs-package-ref.html#jobsubmissionclient.
// Reference to https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html#ray-job-rest-api-spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAPI doc is currently manually written and might itself have bugs; it might be better to use the Python as the source of truth.

// If LightWeightSubmissionMode is true, KubeRay operator sends a request to the RayCluster to create a
// Ray job. Otherwise, the operator creates a submitter Kubernetes Job to submit the Ray job.
// +kubebuilder:default:=false
LightWeightSubmissionMode bool `json:"lightWeightSubmissionMode,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's cleaner to enumerate the submission type instead of a bool.

type RayJobSpec struct {
  ...
  ...

  SubmissionMode JobSubmissionMode  
}

type JobSubmissionMode string

const (
    // Submit job via Kubernetes Job
    BatchJob JobSubmissionMode = "batchv1.Job"
    // Submit job via HTTP request
    HTTP JobSubmissionMode = "HTTP"
    ...
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on whether we anticipate more modes of job submission going forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Updated.

type JobSubmissionMode string

const (
K8sJobMode JobSubmissionMode = "K8sJobMode" // Submit job via Kubernetes Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is the "Mode" part neeeed? I think it's implied from the field name. Maybe just K8sJob and HTTP is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to add "Mode". Without it, users might find the documentation confusing, as they would have to guess whether "HTTP" refers to the submission mode or the protocol. In addition, "K8sJob" and "K8s Job" can also be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HTTP" refers to the submission mode or the protocol

Users will rarely see "HTTP" by itself though. They will usually see submissionMode: HTTP and submissionMode: batchv1.Job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will usually see submissionMode: HTTP and submissionMode: batchv1.Job.

This is reasonable in the YAML files and in the KubeRay codebase, but it may not be accurate for the Ray documentation.

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved
docs/reference/api.md Outdated Show resolved Hide resolved
@kevin85421 kevin85421 merged commit fab00b5 into ray-project:master Feb 1, 2024
23 checks passed
@kevin85421 kevin85421 mentioned this pull request Feb 2, 2024
4 tasks
ryanaoleary pushed a commit to ryanaoleary/kuberay that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] [RayJob] support light-weight submitting job inner controller as a optional mode
3 participants