Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED #1919

Merged
merged 2 commits into from
Feb 13, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Feb 11, 2024

Why are these changes needed?

When ttlSecondsAfterFinished is set to 0, the submitter job may fail and retry, even though the RayJob's has JobDeploymentStatus is Complete and JobStatus is SUCCEEDED. Without this PR, if the Ray job is in a terminal state (e.g. SUCCEEDED), the RayJob will transition from Running to Complete, and check ttlSecondsAfterFinished to determine whether is it ready to delete RayCluster or not. However, the submitter Kubernetes Job may fail because it still requires some time to receive the logs from the Ray head, but the head service has already been deleted.

In this PR, RayJob transitions from Running to Complete if and only if the submitter Job is Complete or Failed.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Without this PR

  • Although all RayJob custom resources have a Complete (JobDeploymentStatus) and SUCCEEDED (JobStatus), one submitter Kubernetes Job fails among the 25 RayJob custom resources.
Screen Shot 2024-02-10 at 11 41 25 PM Screen Shot 2024-02-10 at 11 54 53 PM

With this PR

Screen Shot 2024-02-10 at 11 30 20 PM Screen Shot 2024-02-10 at 11 30 00 PM

@kevin85421 kevin85421 changed the title WIP [Bug] Submitter K8s Job fails although RayJob with JobDeploymentStatus Complete and JobStatus SUCCEEDED Feb 11, 2024
@kevin85421 kevin85421 changed the title [Bug] Submitter K8s Job fails although RayJob with JobDeploymentStatus Complete and JobStatus SUCCEEDED [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED Feb 11, 2024
@kevin85421 kevin85421 marked this pull request as ready for review February 11, 2024 08:06
@kevin85421
Copy link
Member Author

cc @andrewsykim would you mind reviewing this PR? Thanks!

Copy link
Contributor

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 merged commit 5dab94c into ray-project:master Feb 13, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants