Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Submission: Add support for job queue and management. #134

Merged
merged 62 commits into from
Jan 10, 2022
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 5, 2022

This PR aims to provide better support for job queues with the support of ray job.

Main changes:

  1. Replace the program submission logic with ray job.
  2. New resource checking: task resources <= cluster.launched_resources.
  3. New cli's for job queue.
  4. Job queue information is stored on the database on cluster (for multitenancy) backed by sqlite3.
  5. sky logs for tracking logs.

Added features:

  1. sky queue <cluster_name> [--all] (--all: shows all the tasks including the finished ones; sky queue cluster_name will only show the finished task once).
> sky queue test --all
Sky Job Queue of Cluster: test
+-----+-------------+-----------+-----------------------------------------+
| JOB |  SUBMITTED  |   STATUS  | LOG                                     |
+-----+-------------+-----------+-----------------------------------------+
| 117 | 31 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-15-57-40-379226 |
| 116 |  4 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-53-14-428524 |
| 115 |  5 mins ago |  STOPPED  | sky_logs/sky-2022-01-05-15-52-29-825712 |
| 114 |  5 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-52-12-210650 |
| 113 | 20 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-15-38-02-089872 |
+-----+-------------+-----------+-----------------------------------------+
  1. sky cancel -c <cluster_name> [<job_id>|--all]
  2. sky logs -c <cluster_name> <job_id>: tailing the log of a job in real time.
  3. To do detached execution: sky exec -c <cluster_name> yaml_file --detach.

TODOs:

  • Tested examples/minimal.yaml with multiple jobs (including accelerators).
  • Better job name for user's convenience.
  • Tested single nodes/multiple jobs with less demanding resources. (examples/job_queue/cluster.yaml; examples/job_queue/job.yaml: two jobs with V100: 0.5)
Sky Job Queue of Cluster: test
+-----+-------------+-----------+-----------------------------------------+
| JOB |  SUBMITTED  |   STATUS  | LOG                                     |
+-----+-------------+-----------+-----------------------------------------+
| 124 | 12 secs ago |  PENDING  | sky_logs/sky-2022-01-05-16-14-49-575592 |
| 123 | 20 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-16-14-41-608605 |
| 122 | 29 secs ago |  RUNNING  | sky_logs/sky-2022-01-05-16-14-30-905822 |
| 121 |  2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-05-16-12-38-642178 |
+-----+-------------+-----------+-----------------------------------------+
  • Pipe logs and add hints.
  • Test multiple nodes multi-node job. (examples/job_queue/multinode.yaml; examples/job_queue/multinode_job.yaml: two jobs with 2x K80: 0.5)
  • Multiple nodes with single-node job. (examples/job_queue/multinode.yaml with 2x K80; examples/job_queue/job.yaml: two jobs with 1x K80: 0.5: 4 jobs in parallel)
  • Multiple nodes with single-node job and multi-node job.
  • Distributed training with multiple nodes.
  • Test examples/run_smoke_tests.sh
  • Deprecate the ParTask and rewrite ParTask examples.

Future:

  • (! Important) Fix job execution order. (ray seems not preserving the FIFO order of the job queue).
Sky Job Queue of Cluster test2
+-----+------+------------+-----------+-----------------------------------------+
| JOB | USER | SUBMITTED  |   STATUS  | LOG                                     |
+-----+------+------------+-----------+-----------------------------------------+
| 103 | zhwu | 2 mins ago |  PENDING  | sky_logs/sky-2022-01-06-02-36-36-737105 |
| 105 | zhwu | 1 min ago  |  RUNNING  | sky_logs/sky-2022-01-06-02-37-03-308653 |
| 104 | zhwu | 1 min ago  |  RUNNING  | sky_logs/sky-2022-01-06-02-36-50-699314 |
| 102 | zhwu | 2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-06-02-36-23-662007 |
| 101 | zhwu | 2 mins ago | SUCCEEDED | sky_logs/sky-2022-01-06-02-36-06-512089 |
+-----+------+------------+-----------+-----------------------------------------+
  • Fix executing job with num_nodes=2 on a cluster with 3 nodes. (This is blocked by the post_setup_fn design for distributed training).
  • Fix log downloading logic. (When should we download the logs? Zongheng: maybe don't).
  • Launching cluster for a task with K80: 0.5 should round to K80: 1.

@gmittal
Copy link
Collaborator

gmittal commented Jan 5, 2022

Really excited about this

@Michaelvll Michaelvll force-pushed the job-queue branch 2 times, most recently from 1ee5566 to 3eb9048 Compare January 7, 2022 00:57
@Michaelvll Michaelvll changed the title Job Queue: Refactor the CloudVMRayBackend with ray job Job Submission: Add support for job queue and management. Jan 9, 2022
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with a complete pass.

prototype/sky/backends/backend_utils.py Outdated Show resolved Hide resolved
prototype/sky/backends/local_docker_backend.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/backends/remote_libs/log_lib.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, extraordinary work! Just a few nits.

prototype/examples/multi_echo.py Outdated Show resolved Hide resolved
prototype/sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
assert len(task.resources) == 1, task.resources

launched_resources = handle.launched_resources.fill_accelerators()
task_resources = list(task.resources)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other comment

This is tricky and can use a comment (something like, "task may (e.g., sky run) or may not (e.g., sky exec) have undergone sky.optimize()..."?). Also, if task.best_resources is available, should use it, otherwise, use L63-64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, we should compare the task's requested resources with the launched one, instead of the best_resources. That is because best_resources will do an uneccessary narrow down of the requested resources (making it more demanding, e.g. specific cloud, instance_type, etc.).

prototype/sky/backends/remote_libs/job_lib.py Outdated Show resolved Hide resolved
prototype/sky/backends/remote_libs/job_lib.py Outdated Show resolved Hide resolved


def log_dir(job_id: int) -> Tuple[Optional[str], Optional[str]]:
def log_dir(job_id: int) -> Tuple[Optional[str], Optional[JobStatus]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type: they can't be optional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can, since the job_id may be failed to find. : )

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jan 9, 2022

A logging problem found. Whenever the number of jobs submitted contains [a-f] in hex, the jobs' logs will be messed up together. Ray PR reference. The problem is caused by this line in ray: v1.9.1 vs master branch. It seems the problem will be solved in v1.10.0. I raise an issue for this problem #157 .

@Michaelvll
Copy link
Collaborator Author

Thank @concretevitamin for the detailed reviews, I am merging this PR with some more fixes for the tpu_app and job failure. I raise an issue for the future TODOs #158.

@Michaelvll Michaelvll merged commit 5135a50 into master Jan 10, 2022
@Michaelvll Michaelvll deleted the job-queue branch January 11, 2022 01:39
gmittal pushed a commit that referenced this pull request Mar 15, 2022
* Change ray version to 1.9.1

* Add job status in cli

* Check resource-demanding less than cluster resources

* Add jobs table and fix to_provision in write_cluster_config

* Fix cancel jobs

* Nicer job id

* Fix running indicator

* Refactor the job fetching to cloud vm backend

* job queue in descending order

* Add job queue example

* Move job db to cluster to support multitenancy

* Fix util file mounting

* change 'RESERVED' to 'INIT'

* Fix job succeeded status

* Add logging for sky job

* Auto end for logging

* Fix auto exit for log tailing

* Add fixme

* Fix sky cancel

* Fix logging issue

* Fix 1 node job on multinode, fix job cancelling

* docs and format

* Fix write cluster config

* Fix logging path

* Replace ParTask with normal Tasks

* update smoke test for job cancel and logging

* fix file_mounts; wait for logging to finish

* Make job adding thread safe and rename libs

* Fix comments

* Add sky.run and sky.exec

* Replace the sky.execute with sky.run/sky.exec

* change __init__ for run and exec

* fix job_lib function

* Fix comments

* Improve robustness

* Fix TPU and comments

* Fix job failing by checking the returncode

* Format and change the exception
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants