Job Submission: Add support for job queue and management. #134

Michaelvll · 2022-01-05T09:28:48Z

This PR aims to provide better support for job queues with the support of ray job.

Main changes:

Replace the program submission logic with ray job.
New resource checking: task resources <= cluster.launched_resources.
New cli's for job queue.
Job queue information is stored on the database on cluster (for multitenancy) backed by sqlite3.
sky logs for tracking logs.

Added features:

sky queue <cluster_name> [--all] (--all: shows all the tasks including the finished ones; sky queue cluster_name will only show the finished task once).

> sky queue test --all
Sky Job Queue of Cluster: test
+-----+-------------+-----------+-----------------------------------------+
| JOB |  SUBMITTED  |   STATUS  | LOG                                     |
+-----+-------------+-----------+-----------------------------------------+
| 117 | 31 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-15-57-40-379226 |
| 116 |  4 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-53-14-428524 |
| 115 |  5 mins ago |  STOPPED  | sky_logs/sky-2022-01-05-15-52-29-825712 |
| 114 |  5 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-52-12-210650 |
| 113 | 20 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-38-02-089872 |
+-----+-------------+-----------+-----------------------------------------+

sky cancel -c <cluster_name> [<job_id>|--all]
sky logs -c <cluster_name> <job_id>: tailing the log of a job in real time.
To do detached execution: sky exec -c <cluster_name> yaml_file --detach.

TODOs:

Tested examples/minimal.yaml with multiple jobs (including accelerators).
Better job name for user's convenience.
Tested single nodes/multiple jobs with less demanding resources. (examples/job_queue/cluster.yaml; examples/job_queue/job.yaml: two jobs with V100: 0.5)

Sky Job Queue of Cluster: test
+-----+-------------+-----------+-----------------------------------------+
| JOB |  SUBMITTED  |   STATUS  | LOG                                     |
+-----+-------------+-----------+-----------------------------------------+
| 124 | 12 secs ago |  PENDING  | sky_logs/sky-2022-01-05-16-14-49-575592 |
| 123 | 20 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-16-14-41-608605 |
| 122 | 29 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-16-14-30-905822 |
| 121 |  2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-16-12-38-642178 |
+-----+-------------+-----------+-----------------------------------------+

Pipe logs and add hints.
Test multiple nodes multi-node job. (examples/job_queue/multinode.yaml; examples/job_queue/multinode_job.yaml: two jobs with 2x K80: 0.5)
Multiple nodes with single-node job. (examples/job_queue/multinode.yaml with 2x K80; examples/job_queue/job.yaml: two jobs with 1x K80: 0.5: 4 jobs in parallel)
Multiple nodes with single-node job and multi-node job.
Distributed training with multiple nodes.
Test examples/run_smoke_tests.sh
Deprecate the ParTask and rewrite ParTask examples.

Future:

(! Important) Fix job execution order. (ray seems not preserving the FIFO order of the job queue).

Sky Job Queue of Cluster test2
+-----+------+------------+-----------+-----------------------------------------+
| JOB | USER | SUBMITTED  |   STATUS  | LOG                                     |
+-----+------+------------+-----------+-----------------------------------------+
| 103 | zhwu | 2 mins ago |  PENDING  | sky_logs/sky-2022-01-06-02-36-36-737105 |
| 105 | zhwu | 1 min ago  |  RUNNING  | sky_logs/sky-2022-01-06-02-37-03-308653 |
| 104 | zhwu | 1 min ago  |  RUNNING  | sky_logs/sky-2022-01-06-02-36-50-699314 |
| 102 | zhwu | 2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-06-02-36-23-662007 |
| 101 | zhwu | 2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-06-02-36-06-512089 |
+-----+------+------------+-----------+-----------------------------------------+

Fix executing job with num_nodes=2 on a cluster with 3 nodes. (This is blocked by the post_setup_fn design for distributed training).
Fix log downloading logic. （When should we download the logs? Zongheng: maybe don't).
Launching cluster for a task with K80: 0.5 should round to K80: 1.

gmittal · 2022-01-05T23:24:15Z

Really excited about this

concretevitamin

Done with a complete pass.

prototype/sky/backends/backend_utils.py

prototype/sky/backends/local_docker_backend.py

prototype/sky/task.py

prototype/sky/backends/remote_libs/log_lib.py

prototype/sky/backends/cloud_vm_ray_backend.py

concretevitamin

Again, extraordinary work! Just a few nits.

prototype/examples/multi_echo.py

prototype/sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2022-01-09T08:40:08Z

prototype/sky/backends/cloud_vm_ray_backend.py

+        assert len(task.resources) == 1, task.resources
+
+        launched_resources = handle.launched_resources.fill_accelerators()
+        task_resources = list(task.resources)[0]


Similar to the other comment

This is tricky and can use a comment (something like, "task may (e.g., sky run) or may not (e.g., sky exec) have undergone sky.optimize()..."?). Also, if task.best_resources is available, should use it, otherwise, use L63-64?

For this one, we should compare the task's requested resources with the launched one, instead of the best_resources. That is because best_resources will do an uneccessary narrow down of the requested resources (making it more demanding, e.g. specific cloud, instance_type, etc.).

prototype/sky/backends/remote_libs/job_lib.py

concretevitamin · 2022-01-09T08:51:21Z

prototype/sky/backends/remote_libs/job_lib.py



-def log_dir(job_id: int) -> Tuple[Optional[str], Optional[str]]:
+def log_dir(job_id: int) -> Tuple[Optional[str], Optional[JobStatus]]:


return type: they can't be optional?

They can, since the job_id may be failed to find. : )

Michaelvll · 2022-01-09T10:59:46Z

A logging problem found. Whenever the number of jobs submitted contains [a-f] in hex, the jobs' logs will be messed up together. Ray PR reference. The problem is caused by this line in ray: v1.9.1 vs master branch. It seems the problem will be solved in v1.10.0. I raise an issue for this problem #157 .

Michaelvll · 2022-01-10T01:03:05Z

Thank @concretevitamin for the detailed reviews, I am merging this PR with some more fixes for the tpu_app and job failure. I raise an issue for the future TODOs #158.

* Change ray version to 1.9.1 * Add job status in cli * Check resource-demanding less than cluster resources * Add jobs table and fix to_provision in write_cluster_config * Fix cancel jobs * Nicer job id * Fix running indicator * Refactor the job fetching to cloud vm backend * job queue in descending order * Add job queue example * Move job db to cluster to support multitenancy * Fix util file mounting * change 'RESERVED' to 'INIT' * Fix job succeeded status * Add logging for sky job * Auto end for logging * Fix auto exit for log tailing * Add fixme * Fix sky cancel * Fix logging issue * Fix 1 node job on multinode, fix job cancelling * docs and format * Fix write cluster config * Fix logging path * Replace ParTask with normal Tasks * update smoke test for job cancel and logging * fix file_mounts; wait for logging to finish * Make job adding thread safe and rename libs * Fix comments * Add sky.run and sky.exec * Replace the sky.execute with sky.run/sky.exec * change __init__ for run and exec * fix job_lib function * Fix comments * Improve robustness * Fix TPU and comments * Fix job failing by checking the returncode * Format and change the exception

Michaelvll force-pushed the job-queue branch from 1eaf8cc to 45d42f4 Compare January 5, 2022 20:06

Michaelvll force-pushed the job-queue branch 2 times, most recently from 1ee5566 to 3eb9048 Compare January 7, 2022 00:57

Michaelvll added 26 commits January 7, 2022 12:03

Change ray version to 1.9.1

3045e68

Fix ray version

1f022f8

Fix ray version

117db8c

Add job status in cli

c2b8ec8

Check resource demanding less than cluster resources

29ecf05

fix comments

86c3bdd

Add jobs table and fix to_provision in write_cluster_config

a238f99

Fix cancel jobs

a390608

Nicer job id

19b9cca

Fix running indicator

a43f428

Refactor the job fetching to cloud vm backend

de0364c

format

b9928d6

job queue in descending order

c9bc9ec

Add job queue example

a8bd49a

format

7c2b80f

Move job db to cluster to support multitenancy

4397ea0

Fix util file mounting

ea27d3c

format

7d25b42

change 'RESERVED' to 'INIT'

582ca59

Fix aws-ray.yml.j2

90a9133

Fix job succeeded status

52d89fb

Add sky logs

bd74e12

Add logging for sky job

f95748c

Auto end for logging

714e143

Fix auto exit for log tailing

99e5c93

format

b09306e

Michaelvll added 3 commits January 8, 2022 15:44

Fix comments

dc2c0a2

Fix comments

82d93e0

Fix comments

e55c64b

Michaelvll changed the title ~~Job Queue: Refactor the CloudVMRayBackend with ray job~~ Job Submission: Add support for job queue and management. Jan 9, 2022

concretevitamin reviewed Jan 9, 2022

View reviewed changes

Michaelvll added 12 commits January 8, 2022 20:56

Fix comments; add sky.run and sky.exec

b2a1db3

Replace the sky.execute with sky.run/sky.exec

7e15b45

change __init__ for run and exec

8d11658

Fix comments

e74b637

Add comments

65ceaeb

Fix comments

397900e

fix job_lib function

d66ca5e

Fix comments

284d1c2

fix function names

4eea01e

Improve robustness

00b677e

format/linting

2219870

format

3b49f39

Michaelvll force-pushed the job-queue branch from f9e4cb4 to 3b49f39 Compare January 9, 2022 07:16

concretevitamin approved these changes Jan 9, 2022

View reviewed changes

Michaelvll added 3 commits January 9, 2022 01:33

Fix TPU and comments

9264481

Fix job failing by checking the returncode

1470be0

Format and change the exception

86e7e4e

Michaelvll merged commit 5135a50 into master Jan 10, 2022

michaelzhiluo mentioned this pull request Jan 10, 2022

Sky Exec/Run Bash Command #146

Merged

This was referenced Jan 11, 2022

Programmatic API providing similar interface as cli #152

Closed

Fork sky.execute when stream_logs=False #59

Closed

Michaelvll deleted the job-queue branch January 11, 2022 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Submission: Add support for job queue and management. #134

Job Submission: Add support for job queue and management. #134

Michaelvll commented Jan 5, 2022 •

edited

gmittal commented Jan 5, 2022

concretevitamin left a comment

concretevitamin left a comment

concretevitamin Jan 9, 2022

Michaelvll Jan 9, 2022

concretevitamin Jan 9, 2022

Michaelvll Jan 9, 2022

Michaelvll commented Jan 9, 2022 •

edited

Michaelvll commented Jan 10, 2022



		def log_dir(job_id: int) -> Tuple[Optional[str], Optional[str]]:
		def log_dir(job_id: int) -> Tuple[Optional[str], Optional[JobStatus]]:

Job Submission: Add support for job queue and management. #134

Job Submission: Add support for job queue and management. #134

Conversation

Michaelvll commented Jan 5, 2022 • edited

gmittal commented Jan 5, 2022

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jan 9, 2022

Choose a reason for hiding this comment

Michaelvll Jan 9, 2022

Choose a reason for hiding this comment

concretevitamin Jan 9, 2022

Choose a reason for hiding this comment

Michaelvll Jan 9, 2022

Choose a reason for hiding this comment

Michaelvll commented Jan 9, 2022 • edited

Michaelvll commented Jan 10, 2022

Michaelvll commented Jan 5, 2022 •

edited

Michaelvll commented Jan 9, 2022 •

edited