SkyBenchmark: fix job_status is None for failed candidates. #2767

concretevitamin · 2023-11-10T17:19:02Z

Attempt to fix #2765.

Repro:

» cat test.yaml
run: |
  nvidia-smi

resources:
  cloud: gcp
  region: us-central1
  zone: us-central1-f

sky bench launch test.yaml --gpus A100:8,T4 --benchmark test1

Before: failed to parse --gpus; after fixing parsing and waiting for bench finished launching (A100:8)

sky bench show mybench reproduces the error in [Benchmark] TypeError: '<' not supported between instances of 'NoneType' and 'JobStatus' #2765

With this PR: no error occurs

» sky bench  show test1
Legend:
- #STEPS: Number of steps taken.
- SEC/STEP, $/STEP: Average time (cost) per step.
- EST(hr), EST($): Estimated total time (cost) to complete the benchmark.

CLUSTER            RESOURCES                        STATUS    DURATION  SPENT($)  #STEPS  SEC/STEP  $/STEP  EST(hr)  EST($)
sky-bench-test1-1  1x GCP(n1-highmem-4, {'T4': 1})  FINISHED  < 1s      0.0001    -       -         -       -        -

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
- pytest tests/test_smoke.py::test_sky_bench --generic-cloud gcp: passed on this PR; failed on master
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll

Thanks for fixing this @concretevitamin! Left a question.

Michaelvll · 2023-11-13T19:51:18Z

sky/benchmark/benchmark_utils.py

+    if job_status is None:
+        benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED
+    elif (cluster_status == status_lib.ClusterStatus.INIT or
+          job_status < job_lib.JobStatus.RUNNING):


Suggested change

if job_status is None:

benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED

elif (cluster_status == status_lib.ClusterStatus.INIT or

job_status < job_lib.JobStatus.RUNNING):

if (cluster_status == status_lib.ClusterStatus.INIT or job_status is None or

job_status < job_lib.JobStatus.RUNNING):

Reason: when the cluster is being provisioned, the job_status will be None and the benchmark task should be initializing.

There seems to be a few cases

# 'record' is not None: # cluster_status None (e.g., preempted/never launched), job_status None # --> BenchmarkStatus.TERMINATED or INIT # cluster_status INIT (e.g., something's wrong), job_status None # --> BenchmarkStatus ?? # cluster_status STOPPED (e.g., manually stopped or auto-stopped), job_status None # --> BenchmarkStatus ?? # cluster_status UP (e.g., manually stopped or auto-stopped), job_status None # --> BenchmarkStatus.INIT # cluster_status UP (e.g., manually stopped or auto-stopped), job_status not-None # --> handled below

The problem is BenchmarkStatus's definition doesn't seem too clear (e.g., for the first case, should we set it to TERMINATED or INIT; does it matter?). Wdyt?

IMO, it would be better to set the BenchmarkStatus to INIT for all the cases mentinoed, as it aligns with the semantic of INIT, i.e., the status is UNKNOWN or abnormal or initializing. I would not set it to TERMINATED as it may cause a transition from TERMINATED to a non-terminated state in the case cluster_status is None or cluster_status is UP while job_status is None, which can be quite suprising. The TERMINATED state should be a sink state.
Maybe we can specially handle the case where the cluster_status is STOPPED, but I think it should be fine to set all of the to BenchmarkStatus.INIT.

concretevitamin · 2023-11-20T01:04:27Z

Updated logic and PR description, PTAL.

    #   cluster_status None (e.g., preempted/never launched), job_status None
    #     --> BenchmarkStatus TERMINATED
    #   cluster_status INIT (e.g., something's wrong, or still launching), job_status None
    #     --> BenchmarkStatus INIT
    #   cluster_status STOPPED (e.g., auto-stopped), job_status None
    #     --> _determine_finished_or_terminated()
    #   cluster_status UP, job_status None
    #     --> BenchmarkStatus.INIT
    #   cluster_status UP, job_status not-None
    #     --> if job_status < RUNNING: BenchmarkStatus.INIT
    #     --> if job_status == RUNNING: BenchmarkStatus.RUNNING
    #     --> if job_status.is_terminal(): _determine_finished_or_terminated()

For the first case, it's ok to send it to the sink state TERMINATED, since there's no way to "rerun" the same benchmark name to retry the launch. Wdyt?

Michaelvll

Thanks for the fix @concretevitamin! LGTM.

Michaelvll · 2023-11-21T23:13:17Z

sky/benchmark/benchmark_utils.py

        if end_time is not None:
            # The job has terminated with zero exit code.
-            benchmark_status = benchmark_state.BenchmarkStatus.FINISHED
+            return end_time, benchmark_state.BenchmarkStatus.FINISHED


This seems will never happen as we have checked end_time in L348?

Good catch! Simplified the logic now.

concretevitamin

Simplified the logic + added a basic smoke test. PTAL.

concretevitamin · 2023-11-22T16:54:22Z

sky/benchmark/benchmark_utils.py

        if end_time is not None:
            # The job has terminated with zero exit code.
-            benchmark_status = benchmark_state.BenchmarkStatus.FINISHED
+            return end_time, benchmark_state.BenchmarkStatus.FINISHED


Good catch! Simplified the logic now.

…-org#2767) * SkyBenchmark: fix job_status is None for failed candidates. * Fix --gpus parsing * Logic updates * Simplify and add smoke test.

SkyBenchmark: fix job_status is None for failed candidates.

fb265bb

concretevitamin requested a review from WoosukKwon November 10, 2023 17:19

concretevitamin mentioned this pull request Nov 10, 2023

[Benchmark] TypeError: '<' not supported between instances of 'NoneType' and 'JobStatus' #2765

Closed

Fix --gpus parsing

dd4f2fd

concretevitamin requested a review from Michaelvll November 11, 2023 17:22

Michaelvll reviewed Nov 13, 2023

View reviewed changes

concretevitamin added 2 commits November 19, 2023 09:53

Merge branch 'master' into fix-sky-bench

bf3648b

Logic updates

7f8a15b

concretevitamin requested a review from Michaelvll November 20, 2023 01:04

Michaelvll approved these changes Nov 21, 2023

View reviewed changes

Simplify and add smoke test.

5334347

concretevitamin commented Nov 22, 2023

View reviewed changes

concretevitamin merged commit c1be039 into master Nov 28, 2023
19 checks passed

concretevitamin deleted the fix-sky-bench branch November 28, 2023 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SkyBenchmark: fix job_status is None for failed candidates. #2767

SkyBenchmark: fix job_status is None for failed candidates. #2767

concretevitamin commented Nov 10, 2023 •

edited

Michaelvll left a comment

Michaelvll Nov 13, 2023

concretevitamin Nov 14, 2023

Michaelvll Nov 15, 2023

concretevitamin commented Nov 20, 2023

Michaelvll left a comment

Michaelvll Nov 21, 2023

concretevitamin Nov 22, 2023

concretevitamin left a comment

concretevitamin Nov 22, 2023

SkyBenchmark: fix job_status is None for failed candidates. #2767

SkyBenchmark: fix job_status is None for failed candidates. #2767

Conversation

concretevitamin commented Nov 10, 2023 • edited

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Nov 13, 2023

Choose a reason for hiding this comment

concretevitamin Nov 14, 2023

Choose a reason for hiding this comment

Michaelvll Nov 15, 2023

Choose a reason for hiding this comment

concretevitamin commented Nov 20, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Nov 21, 2023

Choose a reason for hiding this comment

concretevitamin Nov 22, 2023

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Nov 22, 2023

Choose a reason for hiding this comment

concretevitamin commented Nov 10, 2023 •

edited