Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job] add driver exit code to job info #39675

Merged
merged 10 commits into from
Sep 24, 2023

Conversation

jonathan-anyscale
Copy link
Contributor

Why are these changes needed?

Add driver exit code to JobInfo class which handled by JobManager

Related issue number

Closes #31183

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
dashboard/modules/job/common.py Outdated Show resolved Hide resolved
dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved
dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg

dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved
dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved
@rkooo567 rkooo567 self-assigned this Sep 22, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just minor comments/questions

@@ -98,3 +98,7 @@ class JobDetails(BaseModel):
None,
description="The node ID of the node the job entrypoint command is running on.",
)
driver_exit_code: Optional[str] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity why Optional[str] and not Optional[int]? Also, in the description we should also define when it would return None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, the actual output is int so shld update that

@@ -95,6 +95,8 @@ class JobInfo:
#: The node id that driver running on. It will be None only when the job status
# is PENDING, and this field will not be deleted or modified even if the driver dies
driver_node_id: Optional[str] = None
#: The driver process exit code after driver terminates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define in the comment when this would be None. I think this is what shows up in the auto-generated Ray docs, but it would be good to confirm this and see if there's anywhere else in the docs that we need to update

@@ -908,6 +922,8 @@ async def test_failed_job(self, job_manager):
await async_wait_for_condition_async_predicate(
check_job_failed, job_manager=job_manager, job_id=job_id
)
data = await job_manager.get_job_info(job_id)
assert data.driver_exit_code == -9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about the meaning of -9 and why it's expected?

@@ -380,6 +380,8 @@ message JobsAPIInfo {
// The node id that driver running on. It will be None only when the job status
// is PENDING, and this field will not be deleted or modified even if the driver dies
optional string driver_node_id = 13;
// The driver process exit code after driver terminates
optional string driver_exit_code = 14;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why string instead of int

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test should have failed here, right? Or can a Python int silently be loaded into a string protobuf without any error? (Or is this line of code not currently run in any test?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't fail any tests before hmm

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale changed the title add driver exit code to job info [Job] add driver exit code to job info Sep 22, 2023
@jjyao jjyao merged commit f2eaea0 into ray-project:master Sep 24, 2023
103 of 108 checks passed
@jonathan-anyscale jonathan-anyscale linked an issue Sep 26, 2023 that may be closed by this pull request
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Sep 26, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@rkooo567
Copy link
Contributor

Btw, we should follow up with this. Adding the exit code itself is not very useful (we should integrate to prod or Ray dashboard)

@architkulkarni
Copy link
Contributor

Btw, we should follow up with this. Adding the exit code itself is not very useful (we should integrate to prod or Ray dashboard)

Adding it to the dashboard is tracked here #38142

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[jobs] Return status code after job is completed [job submission] Add driver exit code in JobInfo
4 participants