Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Job submission support to the API server #1639

Merged
merged 20 commits into from
Nov 21, 2023

Conversation

blublinsky
Copy link
Contributor

Why are these changes needed?

API server already provides support for Ray Job, but its usage is not very convenient for Python developer. Its not easy to get execution log, stop running job, etc. This functionality is provided by Ray job submission API, but requires knowledge about endpoint and corresponding security when used outside the cluster. This implementation provides all the functionality through the API server (with the well known URL) and corresponding Python client

Related issue number

Checks

  • [x ] I've made sure the tests are passing.
  • Testing Strategy
    • [x ] Unit tests
    • Manual tests
    • This PR is not tested :(

@blublinsky
Copy link
Contributor Author

@z103cb @tedhtchang @astefanutti, pls take a look
cc @kevin85421

Copy link
Contributor

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I was able to follow the README to test the PR. There are some typos.

@@ -0,0 +1,211 @@
# Job Submission using API server

Ray provides very convinient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray provides very convinient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress.
Ray provides very convenient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky these doc changes don't show up as fixed on Github, can you double check on your end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky These fixes aren't showing up as fixed on Github, can you double check on your end? Are they being overwritten somehow?


### Deploy KubeRay operator and API server

Reffer to [readme](README.md) for setting up KubRay operator and API server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reffer to [readme](README.md) for setting up KubRay operator and API server.
Refer to [readme](README.md) for setting up KubRay operator and API server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not fixed still? Also KubRay -> KubeRay

apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Outdated Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
@blublinsky
Copy link
Contributor Author

Thank you @tedhtchang. Can you, please, approve it

apiserver/JobSubmission.md Outdated Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Show resolved Hide resolved
apiserver/JobSubmission.md Outdated Show resolved Hide resolved
@blublinsky
Copy link
Contributor Author

@kevin85421, I think its ready for commit

@blublinsky
Copy link
Contributor Author

@kevin85421 just a reminder that this one is ready for commit

@kevin85421
Copy link
Member

@architkulkarni can review this PR. I will review it next week due to being overloaded this week.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few minor questions.

@blublinsky From what I can see it seems like all of Ted's doc comments haven't been fixed actually (maybe forgot to push or something got messed up when merging?)

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues that we need to figure out before merging this PR. @architkulkarni is the expert on both Ray and KubeRay. @architkulkarni, could you collaborate with @blublinsky to solve these? Thanks!

(1) Some fields are required in RayJobInfo and RayJobRequest in Ray's definition. However, all fields in the struct are optional.
(2) Double-check the types between KubeRay's dashboard_httpclient.go and Ray. For example, the type of start_time is uint64 in Ray, but it is int64 in this PR, which may cause data loss.
(3) Some fields are missing in the dashboard client.
(4) Is there any change in compatibility between Ray and KubeRay?

@blublinsky
Copy link
Contributor Author

@kevin85421, @architkulkarni addressed your comments

@kevin85421
Copy link
Member

e2e / ray-operator fails. @blublinsky could you take a look?

@@ -352,7 +367,7 @@ func (r *RayDashboardClient) GetJobInfo(ctx context.Context, jobId string) (*Ray
defer resp.Body.Close()

if resp.StatusCode == http.StatusNotFound {
return nil, nil
return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change breaks the RayJob controller, and the Ray cluster dashboard check that's performed prior to submitting the job, and that's been added as part of #1512, with a work-around for #1381.

I think the easier option in the short term is to update the call from the RayJob controller to check the error (better be NotFound error).

@kevin85421 @architkulkarni what do you think, you guys probably have more context on what's the best strategy to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @astefanutti. For the time being, I rolled back the changes (preserving them as comments). Lets merge this one and move the issue of the RayJob controller to check the error to a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, rolling back sounds good so we can merge this PR. There is another PR in flight which is dealing with this issue: #1623

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky It looks like most of the doc feedback for JobSubmission.md has been commented as "fixed" but is not actually fixed in the file. I'm okay with merging this PR to unblock your other PRs, but please help address the doc feedback either in this PR or a future PR.

@blublinsky
Copy link
Contributor Author

@kevin85421, can we merge, please

@kevin85421
Copy link
Member

@architkulkarni would you mind answering #1639 (review) before we merge the PR? Thanks!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Some fields are required in RayJobInfo and RayJobRequest in Ray's definition. However, all fields in the struct are optional.
(2) Double-check the types between KubeRay's dashboard_httpclient.go and Ray. For example, the type of start_time is uint64 in Ray, but it is int64 in this PR, which may cause data loss.
(3) Some fields are missing in the dashboard client.
(4) Is there any change in compatibility between Ray and KubeRay?

(1) I have double checked them and indicated the necessary changes in the below comments.
(2) This was fixed in 4c12da6 and then un-fixed in 5bb51cc, it's not clear why. @blublinsky can you check this? I verified that a millisecond unix timestamp cannot overflow int64, but we might as well be consistent and use uint64.
(3) @kevin85421 which fields did you notice were missing? Are you talking about things like driver_node_id and driver_exit_code in JobInfo? These were missing before this PR and it's easy to add them later if needed. I don't think it needs to be part of this PR, but let me know if I misinterpreted your comment.
(4) No, the changes to the dashboard API client are all for Ray Job API methods that have been supported since Ray 2.2.0 (DELETE was the most recent one). KubeRay 1.1.0 only supports Ray 2.8.0 and later, so this is fine.

proto/job_submission.proto Show resolved Hide resolved
proto/job_submission.proto Show resolved Hide resolved
@blublinsky
Copy link
Contributor Author

I have double checked them and indicated the necessary changes in the below comments.
(2) This was fixed in 4c12da6 and then un-fixed in 5bb51cc, it's not clear why. @blublinsky can you check this? I verified that a millisecond unix timestamp cannot overflow int64, but we might as well be consistent and use uint64

These fields were already defined as int64. I did not introduce them. We can change them, but it creates more changes outside this PR

@architkulkarni
Copy link
Contributor

We defined as required only the data that user provides data, doing it for data coming from the server does not make sense to me

Ok, that makes sense. Since we're not sending this data to the server there's no downside to leaving it as optional.

These fields were already defined as int64. I did not introduce them. We can change them, but it creates more changes outside this PR

That's fine with me, we can update it in a separate PR.

@kevin85421 thoughts? Is it okay to merge?

@architkulkarni
Copy link
Contributor

Discussed offline with @kevin85421 , it's okay to merge. The failing test Compatibility Test - Nightly (pull_request) is flaky and unrelated (FAIL: test_detached_actor (main.RayFTTestCase.test_detached_actor))

@architkulkarni architkulkarni merged commit 5c0e2e9 into ray-project:master Nov 21, 2023
24 of 25 checks passed
@blublinsky blublinsky deleted the apiserver-jobsubmission branch November 22, 2023 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants