Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayJob] Check dashboard readiness before creating job pod (#1381) #1429

Merged
merged 2 commits into from
Sep 26, 2023

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Sep 16, 2023

Why are these changes needed?

As discussed in the #1381 (comment), I added a readiness check to the dashboard server as a workaround before the release of KubeRay 1.0.0.

I didn't add new tests for these changes because it was hard to mock the rayDashboardClient accordingly to simulate the target behavior. But I did make sure these changes can pass the current tests.

To further add tests for these, I would suggest adding mocking functionalities into the FakeRayDashboardClient first. For example:

func (r *FakeRayDashboardClient) GetJobInfo(ctx context.Context, jobId string) (*RayJobInfo, error) {
	return r.GetJobInfoMock(ctx, jobId)
}

// setup mock
ts := time.Now()
r.GetJobInfoMock = func(ctx context.Context, jobId string) (*RayJobInfo, error) {
	if time.Since(ts) < time.Second*10 {
		return nil, errors.New("dashboard is not ready")
	}
	return nil, nil
}

Related issue number

#1381

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -200,6 +200,15 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
rayDashboardClient := utils.GetRayDashboardClientFunc()
rayDashboardClient.InitClient(clientURL)

// Check the dashboard readiness by checking the err return from rayDashboardClient.GetJobInfo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to send a request to check the dashboard's readiness in each reconciliation. We can only check it when no Kubernetes Job exists.

if errors.IsNotFound(err) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 331 is too far from the dashboard related code. Inserting readiness check there might be a little confusing.

What about checking the readiness right before setting the rayJobInstance.Status.DashboardURL? And we don't check it again if rayJobInstance.Status.DashboardURL is set.

@kevin85421
Copy link
Member

If you plan not to add tests here, would you mind doing some experiments with and without this PR to show that this PR actually relieves this situation?

@rueian rueian force-pushed the fix-rayjob-dashboard-readiness branch from d63177d to b0613b8 Compare September 17, 2023 03:43
@rueian rueian force-pushed the fix-rayjob-dashboard-readiness branch from b0613b8 to 8e68570 Compare September 17, 2023 03:44
@rueian
Copy link
Contributor Author

rueian commented Sep 17, 2023

If you plan not to add tests here, would you mind doing some experiments with and without this PR to show that this PR actually relieves this situation?

Actually, I have managed to add tests.

In order to test the situation, I have done:

  1. Adding the GetJobInfoMock field, which is an atomic.Pointer, to the FakeRayDashboardClient
  2. Introducing a new WaitForDashboardReady JobDeploymentStatus to test for.
  3. Writing test cases to test the transition of the WaitForDashboardReady status and rayJobInstance.Status.DashboardURL before and after mocking.

In addition, the go vet complained about copying atomic value in rayservice_controller_test.go after adding the GetJobInfoMock mock field:

controllers/ray/rayservice_controller_test.go:656:9: return copies lock value: github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.FakeRayDashboardClient contains sync/atomic.Pointer[func(_ context.Context, jobId string) (*github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.RayJobInfo, error)] contains sync/atomic.noCopy
make: *** [vet] Error 1

Therefore, I have updated the rayservice_controller_test.go as well.

@rueian
Copy link
Contributor Author

rueian commented Sep 25, 2023

Hi @kevin85421,

As we discussed offline, I have added an expected_job_pods = 1 check into the current end-to-end test to catch future bugs that are similar to those described in #1381.

In my own experience of running end-to-end tests on my laptop with and without the patch to check the readiness of the dashboard, I confirm that the first job submission pod has failed situation happened about ~20% of the time, like @architkulkarni said, without the patch. While with the patch, the situation never happened again.

@architkulkarni architkulkarni self-assigned this Sep 25, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice tests! Will merge after @kevin85421's approval

@@ -194,12 +194,19 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusWaitForDashboard, err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
// Check the dashboard readiness by checking the err return from rayDashboardClient.GetJobInfo.
// Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound.

A bit confused by this comment -- this means if we get http.StatusNotFound we'll proceed with the job submission? When does this situation happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this situation happen?

http.StatusNotFound is 404. We should not proceed with the job submission if the dashboard URL is 404 not found.

if resp.StatusCode == http.StatusNotFound {
return nil, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so it sounds like the PR in its current state needs to be updated.

Should we update the dashboard client to return an error in the case of 404, or should we check if the response is 404 on the job controller side?

My guess is that updating it in the job controller will have fewer unexpected side effects, but it still seems a bit weird that the HTTP client returns nil, nil in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the dashboard client to return an error in the case of 404?

This seems to be more make sense to me. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this makes more sense. We should just check that it doesn't break anything for other callers, but I don't think it will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok for GetJobInfo to return nil, nil since callers can check the NotFound situation by checking the first return value is nil or not, and they should always do.

With that in mind, returning NotFound as an error seems a little bit redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok for GetJobInfo to return nil, nil since callers can check the NotFound situation by checking the first return value is nil or not, and they should always do.

Yes, users can examine the first returned value, but it seems somewhat irregular not to throw an error when the retrieval of job information fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just note that some people may consider the NotFound as a successful result and I believe that is the reason why the original author chose to return nil err in the GetJobInfo.

Should I change its behavior in this PR? Or should I open another issue for it first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that some people may consider the NotFound as a successful result and I believe that is the reason why the original author chose to return nil err in the GetJobInfo.

If the function is DeleteJobInfo, it makes sense to interpret a 404 as a successful result, as the job may have been deleted by the previous reconciliation. However, I cannot conceive of any justification for deeming a 404 in GetJobInfo as success. Can you think of any specific reasons?

Should I change its behavior in this PR? Or should I open another issue for it first?

I prefer not to include the change in this PR. We can open an issue to track the progress and discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I cannot conceive of any justification for deeming a 404 in GetJobInfo as success. Can you think of any specific reasons?

It is just another standpoint of designing API behavior, regardless of API naming, saying "Hey, I have finished what you asked for successfully without any error. And the result is no result".

// Note that rayDashboardClient.GetJobInfo returns no error in the case of http.StatusNotFound.
// This check is a workaround for https://github.com/ray-project/kuberay/issues/1381.
rayDashboardClient.InitClient(clientURL)
if _, err = rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity why use getJobInfo(jobID)? The job hasn't been submitted yet, so I guess there won't be any jobInfo available yet. Is this just an arbitrary HTTP request to make sure the server is working?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an arbitrary HTTP request to make sure the server is working?

Yes. Can we conclude that the dashboard is ready for the ray job submit command if GetJobInfo succeeds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a fair conclusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is just an arbitrary HTTP request to make sure the server is working.

So it is ok to go proceed with the job submission if we get 404 from the GetJobInfo here. That means the dashboard is working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is ok to go proceed with the job submission if we get 404 from the GetJobInfo here. That means the dashboard is working.

Does this imply that a 404 response will be returned if the JobID does not exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. But a 404 response implies the dashboard server is up.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the contribution! @rueian would you mind opening an issue for #1429 (comment)?

@kevin85421 kevin85421 merged commit 5da4a04 into ray-project:master Sep 26, 2023
21 of 22 checks passed
@rueian
Copy link
Contributor Author

rueian commented Sep 27, 2023

LGTM! Thank you for the contribution! @rueian would you mind opening an issue for #1429 (comment)?

Thanks! I opened here #1453

kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Oct 17, 2023
…project#1381) (ray-project#1429)

* [Bug][RayJob] Check dashboard readiness before creating job pod (ray-project#1381)

* [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (ray-project#1381)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants