Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

Merged
merged 6 commits into from
Oct 31, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Oct 30, 2023

The job reconciler pings the job status endpoint every 3 seconds. Previously, if this failed at any time during the job execution, the job deployment status would be set to FailedToGetJobStatus, and it would never be updated back to JobDeploymentStatusRunning because due to an oversight, the existing code only updates the JobDeploymentStatus if the JobStatus changed. Thus in the case where the JobStatus doesn't change (e.g. "RUNNING" -> "RUNNING"), but there is an intermittent failure to get the job status, the JobDeploymentStatus is never updated from JobDeploymentStatusFailedToGetJobStatus to JobDeploymentStatusRunning.

This PR fixes this bug by explicitly updating the status back to JobDeploymentStatusRunning if the reconcile loop gets a JobStatus when the previous status is FailedToGetJobStatus.

Testing: unit test

Why are these changes needed?

Related issue number

Closes #1489

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <archit@anyscale.com>
Signed-off-by: Archit Kulkarni <archit@anyscale.com>
err = r.updateState(ctx, rayJobInstance, jobInfo, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusFailedToGetJobStatus, err)
// Dashboard service in head pod takes time to start, it's possible we get connection refused error.
// Requeue after few seconds to avoid continuous connection errors.
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

// Update RayJob.Status (Kubernetes CR) from Ray Job Status from Dashboard service
if jobInfo != nil && jobInfo.JobStatus != rayJobInstance.Status.JobStatus {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the bug reproduction by killing the head pod, the previous and current JobStatus were both RUNNING so we wouldn't go into this branch.

@kevin85421 kevin85421 changed the title [Bugfix] [RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic [Bugfix][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic Oct 30, 2023
@kevin85421 kevin85421 changed the title [Bugfix][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic Oct 30, 2023
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another question for the issue #1489. Does KubeRay respect TTL when the status is JobDeploymentStatusFailedToGetJobStatus?

Signed-off-by: Archit Kulkarni <archit@anyscale.com>
@architkulkarni
Copy link
Contributor Author

There is another question for the issue #1489. Does KubeRay respect TTL when the status is JobDeploymentStatusFailedToGetJobStatus?

@kevin85421 No, the shutdown logic is only entered if the JobStatus is SUCCEEDED or FAILED. It is currently done by comparing a timestamp with JobInfo.end_time which is populated by Ray and retrieved via HTTP.

I will create a followup issue to respect TTL in the case where the JobStatus cannot be retrieved.

@kevin85421
Copy link
Member

Could you please open an issue to track the TTL issue? Thanks! Because this PR does not have tests, I will clone your fork to test it manually.

@kevin85421
Copy link
Member

Manual test

  1. Deploy a job which sleeps 3000s
  2. Kill the head pod
  3. The job automatically retries, and the status is successfully updated to Running (Previously it would be stuck at FailedToGetJobStatus)

I followed the reproduction script. When I killed the head Pod, the submitter K8s Job reached its backoffLimit within 3 seconds. Therefore, when the head Pod became ready again, the submitter K8s Job would no longer submit the Ray job.

@kevin85421
Copy link
Member

Discussed with @architkulkarni offline. The reproduction in the PR description cannot test this PR properly. Hence, Archit will work on tests or reproduction scripts to simulate: (1) head Pod is always healthy (2) A status check request from KubeRay is dropped due to some network issues.

Signed-off-by: Archit Kulkarni <archit@anyscale.com>
Signed-off-by: Archit Kulkarni <archit@anyscale.com>
@architkulkarni
Copy link
Contributor Author

@kevin85421 Adding mocks for an end to end test in CI turned out to be cumbersome. I also tried to do a manual test using NetworkPolicy to block the GetJobInfo requests, but I couldn't successfully block them (even though the NetworkPolicy was successfully able to block requests from a debug sidecar on the ray operator pod...)

For now I have added a unit test, let me know if this is sufficient.

Signed-off-by: Archit Kulkarni <archit@anyscale.com>
@architkulkarni architkulkarni changed the title [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running Oct 31, 2023
@kevin85421
Copy link
Member

Unit tests are not sufficient to prove this PR's correctness, but I am OK with not adding e2e tests and testing this PR manually instead because (1) We have a plan to refactor the RayJob codebase after the v1.0.0 release (2) This PR is a release blocker for KubeRay v1.0.0.

I also tried to do a manual test using NetworkPolicy to block the GetJobInfo requests, but I couldn't successfully block them (even though the NetworkPolicy was successfully able to block requests from a debug sidecar on the ray operator pod...)

Wow. I will try it.

@kevin85421
Copy link
Member

kevin85421 commented Oct 31, 2023

@architkulkarni Could you try to use iptables? Thanks!

# Add the following securityContext to the Ray head configuration in your RayJob YAML.
securityContext:
  capabilities:
    add: ['NET_ADMIN']

# Log in to the head Pod and install `curl` and `iptables`.
sudo apt-get update; sudo apt-get install -y curl iptables

# Try to connect to the Ray dashboard
curl -X GET 127.0.0.1:8265
# <!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.1f147255.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>

# Drop all incoming packets to the dashboard port.
sudo iptables -A INPUT -p tcp --dport 8265 -j DROP

# Try to connect to the Ray dashboard again.
curl -X GET 127.0.0.1:8265
# curl: (28) Failed to connect to 127.0.0.1 port 8265: Connection timed out

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Oct 31, 2023

@kevin85421 Thanks for the manual test instructions! It works and the fix works correctly.

After disabling the packets and waiting about 1-2 minutes:

Status:
  Dashboard URL:          rayjob-sample-raycluster-plc9l-head-svc.default.svc.cluster.local:8265
  Job Deployment Status:  FailedToGetJobStatus
  Job Id:                 rayjob-sample-hrmmr
  Job Status:             RUNNING
  Message:                Job is currently running.
  Observed Generation:    2

After deleting the blocking rule with sudo iptables -D INPUT -p tcp --dport 8265 -j DROP:

Status:
  Dashboard URL:          rayjob-sample-raycluster-plc9l-head-svc.default.svc.cluster.local:8265
  Job Deployment Status:  Running
  Job Id:                 rayjob-sample-hrmmr
  Job Status:             RUNNING
  Message:                Job is currently running.
  Observed Generation:    2

@kevin85421 kevin85421 merged commit 5a974fc into master Oct 31, 2023
23 checks passed
kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Nov 2, 2023
@kevin85421 kevin85421 deleted the fix-rayjob-hang-status branch December 27, 2023 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayJob hang with incorrect status when head pod restarts
3 participants