-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job Submission: Add support for job queue and management. #134
Conversation
Really excited about this |
1ee5566
to
3eb9048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with a complete pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, extraordinary work! Just a few nits.
assert len(task.resources) == 1, task.resources | ||
|
||
launched_resources = handle.launched_resources.fill_accelerators() | ||
task_resources = list(task.resources)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the other comment
This is tricky and can use a comment (something like, "task may (e.g., sky run) or may not (e.g., sky exec) have undergone sky.optimize()..."?). Also, if
task.best_resources
is available, should use it, otherwise, use L63-64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, we should compare the task's requested resources with the launched one, instead of the best_resources. That is because best_resources will do an uneccessary narrow down of the requested resources (making it more demanding, e.g. specific cloud, instance_type, etc.).
|
||
|
||
def log_dir(job_id: int) -> Tuple[Optional[str], Optional[str]]: | ||
def log_dir(job_id: int) -> Tuple[Optional[str], Optional[JobStatus]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type: they can't be optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can, since the job_id may be failed to find. : )
A logging problem found. Whenever the number of jobs submitted contains [a-f] in hex, the jobs' logs will be messed up together. Ray PR reference. The problem is caused by this line in ray: v1.9.1 vs master branch. It seems the problem will be solved in v1.10.0. I raise an issue for this problem #157 . |
Thank @concretevitamin for the detailed reviews, I am merging this PR with some more fixes for the tpu_app and job failure. I raise an issue for the future TODOs #158. |
* Change ray version to 1.9.1 * Add job status in cli * Check resource-demanding less than cluster resources * Add jobs table and fix to_provision in write_cluster_config * Fix cancel jobs * Nicer job id * Fix running indicator * Refactor the job fetching to cloud vm backend * job queue in descending order * Add job queue example * Move job db to cluster to support multitenancy * Fix util file mounting * change 'RESERVED' to 'INIT' * Fix job succeeded status * Add logging for sky job * Auto end for logging * Fix auto exit for log tailing * Add fixme * Fix sky cancel * Fix logging issue * Fix 1 node job on multinode, fix job cancelling * docs and format * Fix write cluster config * Fix logging path * Replace ParTask with normal Tasks * update smoke test for job cancel and logging * fix file_mounts; wait for logging to finish * Make job adding thread safe and rename libs * Fix comments * Add sky.run and sky.exec * Replace the sky.execute with sky.run/sky.exec * change __init__ for run and exec * fix job_lib function * Fix comments * Improve robustness * Fix TPU and comments * Fix job failing by checking the returncode * Format and change the exception
This PR aims to provide better support for job queues with the support of
ray job
.Main changes:
sky logs
for tracking logs.Added features:
sky queue <cluster_name> [--all]
(--all: shows all the tasks including the finished ones;sky queue cluster_name
will only show the finished task once).sky cancel -c <cluster_name> [<job_id>|--all]
sky logs -c <cluster_name> <job_id>
: tailing the log of a job in real time.sky exec -c <cluster_name> yaml_file --detach
.TODOs:
examples/job_queue/cluster.yaml
;examples/job_queue/job.yaml
: two jobs with V100: 0.5)examples/job_queue/multinode.yaml
;examples/job_queue/multinode_job.yaml
: two jobs with 2x K80: 0.5)examples/job_queue/multinode.yaml
with 2x K80;examples/job_queue/job.yaml
: two jobs with 1x K80: 0.5: 4 jobs in parallel)examples/run_smoke_tests.sh
Future:
post_setup_fn
design for distributed training).K80: 0.5
should round toK80: 1
.