Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayJob] Avoid nil pointer dereference #1756

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 15, 2023

Why are these changes needed?

Reproduce the error:

  • Step 1: Deploy a KubeRay operator without this PR. I recommend using the nightly image [1].
  • Step 2: Create a RayJob using this YAML. It is configured to sleep for 1000s, giving you ample time to terminate the head Pod while the job is running.
  • Step 3: Delete the head Pod after the job is submitted.
  • Step 4: If the submitter Pod reaches the backoffLimit before the new head Pod is ready, the KubeRay operator will fail after the head Pod is restored. This failure occurs because if the job is not found, both jobInfo and error will return as nil. Consequently, at L274, isJobPendingOrRunning(jobInfo.JobStatus) will attempt to dereference a nil pointer.

[1] To easily reproduce the issue, I recommend using the nightly image, as its backoffLimit is smaller than that of v1.0.0. Hence, "the submitter Pod reaches the backoffLimit before the new head Pod is ready" is more likely to happen.

Screen Shot 2023-12-15 at 1 22 54 PM

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

With this PR, the operator will throw an exception instead of crashing.

Screen Shot 2023-12-15 at 1 35 30 PM

@@ -271,9 +266,6 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
// Job may takes long time to start and finish, let's just periodically requeue the job and check status.
}
if isJobPendingOrRunning(jobInfo.JobStatus) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant. You can check L269 to L300 with this PR.

@@ -301,8 +293,9 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
}

// TODO (kevin85421): Use the source of truth `jobInfo.JobStatus` instead.
if isJobPendingOrRunning(rayJobInstance.Status.JobStatus) {
if isJobPendingOrRunning(jobInfo.JobStatus) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobInfo is the source of truth.

// This does the right thing, but breaks E2E test
// return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster")
return nil, nil
return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid RayJobInfo and error being nil at the same time.

@kevin85421 kevin85421 marked this pull request as ready for review December 15, 2023 19:58
@kevin85421
Copy link
Member Author

Update the PR description. Merge this PR.

@kevin85421 kevin85421 merged commit 664b19a into ray-project:master Dec 15, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants