-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Job submission support to the API server #1639
Added Job submission support to the API server #1639
Conversation
@z103cb @tedhtchang @astefanutti, pls take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I was able to follow the README to test the PR. There are some typos.
@@ -0,0 +1,211 @@ | |||
# Job Submission using API server | |||
|
|||
Ray provides very convinient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray provides very convinient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress. | |
Ray provides very convenient and powerful [Job Submission APIs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html). The issue is that it needs to access cluster's dashboard Ingress, which currently does not have any security implementations. Alternatively you can use Ray cluster head service for this, but in this case your Ray job management code has to run in the same kubernetes cluster as Ray. Job submission implementation in the API server leverages already exposed URL of the API server to locate cluster and use its head service to implement Job management functionality. Because you can access API server running on the remote kubernetes cluster you can use these APIs for managing remote Ray clusters without exposing them via Ingress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blublinsky these doc changes don't show up as fixed on Github, can you double check on your end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blublinsky These fixes aren't showing up as fixed on Github, can you double check on your end? Are they being overwritten somehow?
|
||
### Deploy KubeRay operator and API server | ||
|
||
Reffer to [readme](README.md) for setting up KubRay operator and API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reffer to [readme](README.md) for setting up KubRay operator and API server. | |
Refer to [readme](README.md) for setting up KubRay operator and API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not fixed still? Also KubRay -> KubeRay
Thank you @tedhtchang. Can you, please, approve it |
@kevin85421, I think its ready for commit |
@kevin85421 just a reminder that this one is ready for commit |
Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
This PR adds e2e test following up ray-project#1539 and leverages ray-project#1575. It also factorises configuration across the RayJob e2e tests.
Signed-off-by: Boris Lublinsky <blublinsky@hotmail.com>
@architkulkarni can review this PR. I will review it next week due to being overloaded this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few minor questions.
@blublinsky From what I can see it seems like all of Ted's doc comments haven't been fixed actually (maybe forgot to push or something got messed up when merging?)
clients/python-apiserver-client/python_apiserver_client/params/jobsubmission.py
Show resolved
Hide resolved
clients/python-apiserver-client/python_apiserver_client/params/jobsubmission.py
Outdated
Show resolved
Hide resolved
clients/python-apiserver-client/python_apiserver_client/params/jobsubmission.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some issues that we need to figure out before merging this PR. @architkulkarni is the expert on both Ray and KubeRay. @architkulkarni, could you collaborate with @blublinsky to solve these? Thanks!
(1) Some fields are required in RayJobInfo
and RayJobRequest
in Ray's definition. However, all fields in the struct are optional.
(2) Double-check the types between KubeRay's dashboard_httpclient.go
and Ray. For example, the type of start_time
is uint64
in Ray, but it is int64
in this PR, which may cause data loss.
(3) Some fields are missing in the dashboard client.
(4) Is there any change in compatibility between Ray and KubeRay?
@kevin85421, @architkulkarni addressed your comments |
|
@@ -352,7 +367,7 @@ func (r *RayDashboardClient) GetJobInfo(ctx context.Context, jobId string) (*Ray | |||
defer resp.Body.Close() | |||
|
|||
if resp.StatusCode == http.StatusNotFound { | |||
return nil, nil | |||
return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change breaks the RayJob controller, and the Ray cluster dashboard check that's performed prior to submitting the job, and that's been added as part of #1512, with a work-around for #1381.
I think the easier option in the short term is to update the call from the RayJob controller to check the error (better be NotFound error).
@kevin85421 @architkulkarni what do you think, you guys probably have more context on what's the best strategy to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @astefanutti. For the time being, I rolled back the changes (preserving them as comments). Lets merge this one and move the issue of the RayJob controller to check the error
to a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rolling back sounds good so we can merge this PR. There is another PR in flight which is dealing with this issue: #1623
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blublinsky It looks like most of the doc feedback for JobSubmission.md
has been commented as "fixed" but is not actually fixed in the file. I'm okay with merging this PR to unblock your other PRs, but please help address the doc feedback either in this PR or a future PR.
clients/python-apiserver-client/python_apiserver_client/params/jobsubmission.py
Outdated
Show resolved
Hide resolved
@kevin85421, can we merge, please |
@architkulkarni would you mind answering #1639 (review) before we merge the PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Some fields are required in RayJobInfo and RayJobRequest in Ray's definition. However, all fields in the struct are optional.
(2) Double-check the types between KubeRay's dashboard_httpclient.go and Ray. For example, the type of start_time is uint64 in Ray, but it is int64 in this PR, which may cause data loss.
(3) Some fields are missing in the dashboard client.
(4) Is there any change in compatibility between Ray and KubeRay?
(1) I have double checked them and indicated the necessary changes in the below comments.
(2) This was fixed in 4c12da6 and then un-fixed in 5bb51cc, it's not clear why. @blublinsky can you check this? I verified that a millisecond unix timestamp cannot overflow int64
, but we might as well be consistent and use uint64
.
(3) @kevin85421 which fields did you notice were missing? Are you talking about things like driver_node_id
and driver_exit_code
in JobInfo
? These were missing before this PR and it's easy to add them later if needed. I don't think it needs to be part of this PR, but let me know if I misinterpreted your comment.
(4) No, the changes to the dashboard API client are all for Ray Job API methods that have been supported since Ray 2.2.0 (DELETE was the most recent one). KubeRay 1.1.0 only supports Ray 2.8.0 and later, so this is fine.
These fields were already defined as int64. I did not introduce them. We can change them, but it creates more changes outside this PR |
Ok, that makes sense. Since we're not sending this data to the server there's no downside to leaving it as optional.
That's fine with me, we can update it in a separate PR. @kevin85421 thoughts? Is it okay to merge? |
Discussed offline with @kevin85421 , it's okay to merge. The failing test Compatibility Test - Nightly (pull_request) is flaky and unrelated (FAIL: test_detached_actor (main.RayFTTestCase.test_detached_actor)) |
Why are these changes needed?
API server already provides support for Ray Job, but its usage is not very convenient for Python developer. Its not easy to get execution log, stop running job, etc. This functionality is provided by Ray job submission API, but requires knowledge about endpoint and corresponding security when used outside the cluster. This implementation provides all the functionality through the API server (with the well known URL) and corresponding Python client
Related issue number
Checks