[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

architkulkarni · 2023-10-30T18:00:32Z

The job reconciler pings the job status endpoint every 3 seconds. Previously, if this failed at any time during the job execution, the job deployment status would be set to FailedToGetJobStatus, and it would never be updated back to JobDeploymentStatusRunning because due to an oversight, the existing code only updates the JobDeploymentStatus if the JobStatus changed. Thus in the case where the JobStatus doesn't change (e.g. "RUNNING" -> "RUNNING"), but there is an intermittent failure to get the job status, the JobDeploymentStatus is never updated from JobDeploymentStatusFailedToGetJobStatus to JobDeploymentStatusRunning.

This PR fixes this bug by explicitly updating the status back to JobDeploymentStatusRunning if the reconcile loop gets a JobStatus when the previous status is FailedToGetJobStatus.

Testing: unit test

Why are these changes needed?

Related issue number

Closes #1489

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-10-30T18:22:18Z

ray-operator/controllers/ray/rayjob_controller.go

 		err = r.updateState(ctx, rayJobInstance, jobInfo, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusFailedToGetJobStatus, err)
 		// Dashboard service in head pod takes time to start, it's possible we get connection refused error.
 		// Requeue after few seconds to avoid continuous connection errors.
 		return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
 	}

 	// Update RayJob.Status (Kubernetes CR) from Ray Job Status from Dashboard service
-	if jobInfo != nil && jobInfo.JobStatus != rayJobInstance.Status.JobStatus {


In the bug reproduction by killing the head pod, the previous and current JobStatus were both RUNNING so we wouldn't go into this branch.

ray-operator/controllers/ray/rayjob_controller.go

kevin85421

There is another question for the issue #1489. Does KubeRay respect TTL when the status is JobDeploymentStatusFailedToGetJobStatus?

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-10-30T22:59:45Z

There is another question for the issue #1489. Does KubeRay respect TTL when the status is JobDeploymentStatusFailedToGetJobStatus?

@kevin85421 No, the shutdown logic is only entered if the JobStatus is SUCCEEDED or FAILED. It is currently done by comparing a timestamp with JobInfo.end_time which is populated by Ray and retrieved via HTTP.

I will create a followup issue to respect TTL in the case where the JobStatus cannot be retrieved.

kevin85421 · 2023-10-30T23:04:59Z

Could you please open an issue to track the TTL issue? Thanks! Because this PR does not have tests, I will clone your fork to test it manually.

kevin85421 · 2023-10-30T23:28:27Z

Manual test

Deploy a job which sleeps 3000s

Kill the head pod

The job automatically retries, and the status is successfully updated to Running (Previously it would be stuck at FailedToGetJobStatus)

I followed the reproduction script. When I killed the head Pod, the submitter K8s Job reached its backoffLimit within 3 seconds. Therefore, when the head Pod became ready again, the submitter K8s Job would no longer submit the Ray job.

kevin85421 · 2023-10-30T23:57:00Z

Discussed with @architkulkarni offline. The reproduction in the PR description cannot test this PR properly. Hence, Archit will work on tests or reproduction scripts to simulate: (1) head Pod is always healthy (2) A status check request from KubeRay is dropped due to some network issues.

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni · 2023-10-31T15:38:08Z

@kevin85421 Adding mocks for an end to end test in CI turned out to be cumbersome. I also tried to do a manual test using NetworkPolicy to block the GetJobInfo requests, but I couldn't successfully block them (even though the NetworkPolicy was successfully able to block requests from a debug sidecar on the ray operator pod...)

For now I have added a unit test, let me know if this is sufficient.

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421 · 2023-10-31T17:14:47Z

Unit tests are not sufficient to prove this PR's correctness, but I am OK with not adding e2e tests and testing this PR manually instead because (1) We have a plan to refactor the RayJob codebase after the v1.0.0 release (2) This PR is a release blocker for KubeRay v1.0.0.

I also tried to do a manual test using NetworkPolicy to block the GetJobInfo requests, but I couldn't successfully block them (even though the NetworkPolicy was successfully able to block requests from a debug sidecar on the ray operator pod...)

Wow. I will try it.

kevin85421 · 2023-10-31T18:50:20Z

@architkulkarni Could you try to use iptables? Thanks!

# Add the following securityContext to the Ray head configuration in your RayJob YAML.
securityContext:
  capabilities:
    add: ['NET_ADMIN']

# Log in to the head Pod and install `curl` and `iptables`.
sudo apt-get update; sudo apt-get install -y curl iptables

# Try to connect to the Ray dashboard
curl -X GET 127.0.0.1:8265
# <!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.1f147255.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>

# Drop all incoming packets to the dashboard port.
sudo iptables -A INPUT -p tcp --dport 8265 -j DROP

# Try to connect to the Ray dashboard again.
curl -X GET 127.0.0.1:8265
# curl: (28) Failed to connect to 127.0.0.1 port 8265: Connection timed out

architkulkarni · 2023-10-31T19:03:24Z

@kevin85421 Thanks for the manual test instructions! It works and the fix works correctly.

After disabling the packets and waiting about 1-2 minutes:

Status:
  Dashboard URL:          rayjob-sample-raycluster-plc9l-head-svc.default.svc.cluster.local:8265
  Job Deployment Status:  FailedToGetJobStatus
  Job Id:                 rayjob-sample-hrmmr
  Job Status:             RUNNING
  Message:                Job is currently running.
  Observed Generation:    2

After deleting the blocking rule with sudo iptables -D INPUT -p tcp --dport 8265 -j DROP:

Status:
  Dashboard URL:          rayjob-sample-raycluster-plc9l-head-svc.default.svc.cluster.local:8265
  Job Deployment Status:  Running
  Job Id:                 rayjob-sample-hrmmr
  Job Status:             RUNNING
  Message:                Job is currently running.
  Observed Generation:    2

…ing (ray-project#1583)

…ing (#1583) (#1613)

Improve handling of FailedToGetJobStatus

99e8c50

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni assigned hongchaodeng and kevin85421 Oct 30, 2023

Lint

51b1118

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni commented Oct 30, 2023

View reviewed changes

kevin85421 changed the title ~~[Bugfix] [RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic~~ [Bugfix][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic Oct 30, 2023

kevin85421 changed the title ~~[Bugfix][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic~~ [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic Oct 30, 2023

kevin85421 reviewed Oct 30, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

kevin85421 reviewed Oct 30, 2023

View reviewed changes

Remove defensive code

9ed53b0

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

kevin85421 added the 1.0 label Oct 30, 2023

architkulkarni mentioned this pull request Oct 30, 2023

[RayJob] Respect shutdownAfterJobFinishes and TTL when JobStatus cannot be retrieved #1586

Closed

kevin85421 approved these changes Oct 30, 2023

View reviewed changes

architkulkarni added 2 commits October 31, 2023 08:31

Make unit testable and add unit test

8ab5a54

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Lint

80d1a65

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

Improve comment

743a57d

Signed-off-by: Archit Kulkarni <archit@anyscale.com>

architkulkarni changed the title ~~[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running and allowing shutdown logic~~ [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running Oct 31, 2023

kevin85421 merged commit 5a974fc into master Oct 31, 2023
23 checks passed

kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Nov 2, 2023

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Runn…

f22a640

…ing (ray-project#1583)

kevin85421 added a commit that referenced this pull request Nov 2, 2023

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Runn…

2305520

…ing (#1583) (#1613)

kevin85421 deleted the fix-rayjob-hang-status branch December 27, 2023 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

architkulkarni commented Oct 30, 2023 •

edited

architkulkarni Oct 30, 2023

kevin85421 left a comment

architkulkarni commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

architkulkarni commented Oct 31, 2023

kevin85421 commented Oct 31, 2023

kevin85421 commented Oct 31, 2023 •

edited

architkulkarni commented Oct 31, 2023 •

edited

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583

Conversation

architkulkarni commented Oct 30, 2023 • edited

Why are these changes needed?

Related issue number

Checks

architkulkarni Oct 30, 2023

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

architkulkarni commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

kevin85421 commented Oct 30, 2023

architkulkarni commented Oct 31, 2023

kevin85421 commented Oct 31, 2023

kevin85421 commented Oct 31, 2023 • edited

architkulkarni commented Oct 31, 2023 • edited

architkulkarni commented Oct 30, 2023 •

edited

kevin85421 commented Oct 31, 2023 •

edited

architkulkarni commented Oct 31, 2023 •

edited