-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][RayJob] Avoid nil pointer dereference #1756
[Bug][RayJob] Avoid nil pointer dereference #1756
Conversation
@@ -271,9 +266,6 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil | |||
// Job may takes long time to start and finish, let's just periodically requeue the job and check status. | |||
} | |||
if isJobPendingOrRunning(jobInfo.JobStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant. You can check L269 to L300 with this PR.
@@ -301,8 +293,9 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
} | |||
|
|||
// TODO (kevin85421): Use the source of truth `jobInfo.JobStatus` instead. | |||
if isJobPendingOrRunning(rayJobInstance.Status.JobStatus) { | |||
if isJobPendingOrRunning(jobInfo.JobStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jobInfo
is the source of truth.
// This does the right thing, but breaks E2E test | ||
// return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster") | ||
return nil, nil | ||
return nil, errors.NewBadRequest("Job " + jobId + " does not exist on the cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid RayJobInfo
and error
being nil at the same time.
Update the PR description. Merge this PR. |
Why are these changes needed?
Reproduce the error:
backoffLimit
before the new head Pod is ready, the KubeRay operator will fail after the head Pod is restored. This failure occurs because if the job is not found, bothjobInfo
anderror
will return asnil
. Consequently, at L274,isJobPendingOrRunning(jobInfo.JobStatus)
will attempt to dereference anil
pointer.[1] To easily reproduce the issue, I recommend using the nightly image, as its
backoffLimit
is smaller than that of v1.0.0. Hence, "the submitter Pod reaches thebackoffLimit
before the new head Pod is ready" is more likely to happen.Related issue number
Checks
With this PR, the operator will throw an exception instead of crashing.