-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][RayJob] Check dashboard readiness before creating job pod (#1381) #1429
[Bug][RayJob] Check dashboard readiness before creating job pod (#1381) #1429
Conversation
@@ -200,6 +200,15 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
rayDashboardClient := utils.GetRayDashboardClientFunc() | |||
rayDashboardClient.InitClient(clientURL) | |||
|
|||
// Check the dashboard readiness by checking the err return from rayDashboardClient.GetJobInfo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to send a request to check the dashboard's readiness in each reconciliation. We can only check it when no Kubernetes Job exists.
if errors.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think line 331 is too far from the dashboard related code. Inserting readiness check there might be a little confusing.
What about checking the readiness right before setting the rayJobInstance.Status.DashboardURL
? And we don't check it again if rayJobInstance.Status.DashboardURL
is set.
If you plan not to add tests here, would you mind doing some experiments with and without this PR to show that this PR actually relieves this situation? |
d63177d
to
b0613b8
Compare
b0613b8
to
8e68570
Compare
Actually, I have managed to add tests. In order to test the situation, I have done:
In addition, the
Therefore, I have updated the |
…lar to those described in (ray-project#1381)
Hi @kevin85421, As we discussed offline, I have added an In my own experience of running end-to-end tests on my laptop with and without the patch to check the readiness of the dashboard, I confirm that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, nice tests! Will merge after @kevin85421's approval
@@ -194,12 +194,19 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusWaitForDashboard, err) | |||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |||
} | |||
// Check the dashboard readiness by checking the err return from rayDashboardClient.GetJobInfo. | |||
// Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound.
A bit confused by this comment -- this means if we get http.StatusNotFound we'll proceed with the job submission? When does this situation happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this situation happen?
http.StatusNotFound
is 404. We should not proceed with the job submission if the dashboard URL is 404 not found.
kuberay/ray-operator/controllers/ray/utils/dashboard_httpclient.go
Lines 354 to 356 in 9265154
if resp.StatusCode == http.StatusNotFound { | |
return nil, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so it sounds like the PR in its current state needs to be updated.
Should we update the dashboard client to return an error in the case of 404, or should we check if the response is 404 on the job controller side?
My guess is that updating it in the job controller will have fewer unexpected side effects, but it still seems a bit weird that the HTTP client returns nil, nil
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the dashboard client to return an error in the case of 404?
This seems to be more make sense to me. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this makes more sense. We should just check that it doesn't break anything for other callers, but I don't think it will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok for GetJobInfo to return nil, nil since callers can check the NotFound situation by checking the first return value is nil or not, and they should always do.
With that in mind, returning NotFound as an error seems a little bit redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok for GetJobInfo to return nil, nil since callers can check the NotFound situation by checking the first return value is nil or not, and they should always do.
Yes, users can examine the first returned value, but it seems somewhat irregular not to throw an error when the retrieval of job information fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Just note that some people may consider the NotFound as a successful result and I believe that is the reason why the original author chose to return nil err in the GetJobInfo.
Should I change its behavior in this PR? Or should I open another issue for it first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note that some people may consider the NotFound as a successful result and I believe that is the reason why the original author chose to return nil err in the GetJobInfo.
If the function is DeleteJobInfo
, it makes sense to interpret a 404 as a successful result, as the job may have been deleted by the previous reconciliation. However, I cannot conceive of any justification for deeming a 404 in GetJobInfo as success. Can you think of any specific reasons?
Should I change its behavior in this PR? Or should I open another issue for it first?
I prefer not to include the change in this PR. We can open an issue to track the progress and discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I cannot conceive of any justification for deeming a 404 in GetJobInfo as success. Can you think of any specific reasons?
It is just another standpoint of designing API behavior, regardless of API naming, saying "Hey, I have finished what you asked for successfully without any error. And the result is no result".
// Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound. | ||
// This check is a workaround for https://github.com/ray-project/kuberay/issues/1381. | ||
rayDashboardClient.InitClient(clientURL) | ||
if _, err = rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity why use getJobInfo(jobID)? The job hasn't been submitted yet, so I guess there won't be any jobInfo available yet. Is this just an arbitrary HTTP request to make sure the server is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just an arbitrary HTTP request to make sure the server is working?
Yes. Can we conclude that the dashboard is ready for the ray job submit
command if GetJobInfo
succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's a fair conclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is just an arbitrary HTTP request to make sure the server is working.
So it is ok to go proceed with the job submission if we get 404 from the GetJobInfo here. That means the dashboard is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is ok to go proceed with the job submission if we get 404 from the GetJobInfo here. That means the dashboard is working.
Does this imply that a 404 response will be returned if the JobID does not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not. But a 404 response implies the dashboard server is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the contribution! @rueian would you mind opening an issue for #1429 (comment)?
Thanks! I opened here #1453 |
…project#1381) (ray-project#1429) * [Bug][RayJob] Check dashboard readiness before creating job pod (ray-project#1381) * [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (ray-project#1381)
Why are these changes needed?
As discussed in the #1381 (comment), I added a readiness check to the dashboard server as a workaround before the release of KubeRay 1.0.0.
I didn't add new tests for these changes because it was hard to mock the
rayDashboardClient
accordingly to simulate the target behavior. But I did make sure these changes can pass the current tests.To further add tests for these, I would suggest adding mocking functionalities into the
FakeRayDashboardClient
first. For example:Related issue number
#1381
Checks