Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SkyBenchmark: fix job_status is None for failed candidates. #2767

Merged
merged 5 commits into from
Nov 28, 2023

Conversation

concretevitamin
Copy link
Collaborator

@concretevitamin concretevitamin commented Nov 10, 2023

Attempt to fix #2765.

Repro:

» cat test.yaml
run: |
  nvidia-smi

resources:
  cloud: gcp
  region: us-central1
  zone: us-central1-f

sky bench launch test.yaml --gpus A100:8,T4 --benchmark test1

Before: failed to parse --gpus; after fixing parsing and waiting for bench finished launching (A100:8)

With this PR: no error occurs

» sky bench  show test1
Legend:
- #STEPS: Number of steps taken.
- SEC/STEP, $/STEP: Average time (cost) per step.
- EST(hr), EST($): Estimated total time (cost) to complete the benchmark.

CLUSTER            RESOURCES                        STATUS    DURATION  SPENT($)  #STEPS  SEC/STEP  $/STEP  EST(hr)  EST($)
sky-bench-test1-1  1x GCP(n1-highmem-4, {'T4': 1})  FINISHED  < 1s      0.0001    -       -         -       -        -

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
    • pytest tests/test_smoke.py::test_sky_bench --generic-cloud gcp: passed on this PR; failed on master
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @concretevitamin! Left a question.

Comment on lines 324 to 327
if job_status is None:
benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED
elif (cluster_status == status_lib.ClusterStatus.INIT or
job_status < job_lib.JobStatus.RUNNING):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if job_status is None:
benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED
elif (cluster_status == status_lib.ClusterStatus.INIT or
job_status < job_lib.JobStatus.RUNNING):
if (cluster_status == status_lib.ClusterStatus.INIT or job_status is None or
job_status < job_lib.JobStatus.RUNNING):

Reason: when the cluster is being provisioned, the job_status will be None and the benchmark task should be initializing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a few cases

    # 'record' is not None:
    #   cluster_status None (e.g., preempted/never launched), job_status None
    #     --> BenchmarkStatus.TERMINATED or INIT
    #   cluster_status INIT (e.g., something's wrong), job_status None
    #     --> BenchmarkStatus ??
    #   cluster_status STOPPED (e.g., manually stopped or auto-stopped), job_status None
    #     --> BenchmarkStatus ??
    #   cluster_status UP (e.g., manually stopped or auto-stopped), job_status None
    #     --> BenchmarkStatus.INIT
    #   cluster_status UP (e.g., manually stopped or auto-stopped), job_status not-None
    #     --> handled below

The problem is BenchmarkStatus's definition doesn't seem too clear (e.g., for the first case, should we set it to TERMINATED or INIT; does it matter?). Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be better to set the BenchmarkStatus to INIT for all the cases mentinoed, as it aligns with the semantic of INIT, i.e., the status is UNKNOWN or abnormal or initializing. I would not set it to TERMINATED as it may cause a transition from TERMINATED to a non-terminated state in the case cluster_status is None or cluster_status is UP while job_status is None, which can be quite suprising. The TERMINATED state should be a sink state.
Maybe we can specially handle the case where the cluster_status is STOPPED, but I think it should be fine to set all of the to BenchmarkStatus.INIT.

@concretevitamin
Copy link
Collaborator Author

Updated logic and PR description, PTAL.

    #   cluster_status None (e.g., preempted/never launched), job_status None
    #     --> BenchmarkStatus TERMINATED
    #   cluster_status INIT (e.g., something's wrong, or still launching), job_status None
    #     --> BenchmarkStatus INIT
    #   cluster_status STOPPED (e.g., auto-stopped), job_status None
    #     --> _determine_finished_or_terminated()
    #   cluster_status UP, job_status None
    #     --> BenchmarkStatus.INIT
    #   cluster_status UP, job_status not-None
    #     --> if job_status < RUNNING: BenchmarkStatus.INIT
    #     --> if job_status == RUNNING: BenchmarkStatus.RUNNING
    #     --> if job_status.is_terminal(): _determine_finished_or_terminated()

For the first case, it's ok to send it to the sink state TERMINATED, since there's no way to "rerun" the same benchmark name to retry the launch. Wdyt?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @concretevitamin! LGTM.

Comment on lines 334 to 336
if end_time is not None:
# The job has terminated with zero exit code.
benchmark_status = benchmark_state.BenchmarkStatus.FINISHED
return end_time, benchmark_state.BenchmarkStatus.FINISHED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems will never happen as we have checked end_time in L348?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Simplified the logic now.

Copy link
Collaborator Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the logic + added a basic smoke test. PTAL.

Comment on lines 334 to 336
if end_time is not None:
# The job has terminated with zero exit code.
benchmark_status = benchmark_state.BenchmarkStatus.FINISHED
return end_time, benchmark_state.BenchmarkStatus.FINISHED
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Simplified the logic now.

@concretevitamin concretevitamin merged commit c1be039 into master Nov 28, 2023
19 checks passed
@concretevitamin concretevitamin deleted the fix-sky-bench branch November 28, 2023 17:03
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
…-org#2767)

* SkyBenchmark: fix job_status is None for failed candidates.

* Fix --gpus parsing

* Logic updates

* Simplify and add smoke test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Benchmark] TypeError: '<' not supported between instances of 'NoneType' and 'JobStatus'
2 participants