Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto stop for cluster #653

Merged
merged 40 commits into from
Mar 31, 2022
Merged

Auto stop for cluster #653

merged 40 commits into from
Mar 31, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Mar 26, 2022

This PR enables the auto-stopping feature for the clusters. Design Doc

Main Changes:

  1. sky autostop cluster-name -i 30, auto-stop the cluster after 30 minutes without activities.
  2. Refactor the skylet daemon with events.
  3. Add autostop event in skylet.
  4. Indicate the auto stop status in sky status
  5. Update the status of the cluster autostopped

New API:

  1. sky autostop cluster_name1 cluster_name2: cluster will stop as soon as the cluster is idle
  2. sky autostop cluster_name1 cluster_name2 -i t: cluster will stop after t minutes of inactivity.
  3. sky autostop cluster_name1 cluster_name2 --cancel: cancel the autostop on the cluster_name.
  4. sky status --refresh: refresh the status for auto-stopped cluster.

Problem:

  1. sky status will be slower if one of the clusters set autostop.
(sky-dev) ➜  sky-experiment-dev (auto-stop) time sky status                                                            
NAME               LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                                    
test-auto-azure-4  21 mins ago  4x Azure(Standard_D8_v4)  INIT    -         sky launch -c test-auto-azure-4 ...        
test-auto-4        15 mins ago  4x GCP(n1-highmem-8)      UP      -         sky launch -c test-auto-4 --cloud gcp ...  
sky status  1.31s user 0.33s system 213% cpu 0.767 total

(sky-dev) ➜  sky-experiment-dev (auto-stop) time sky status --refresh
NAME               LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                                    
test-auto-azure-4  21 mins ago  4x Azure(Standard_D8_v4)  INIT    -         sky launch -c test-auto-azure-4 ...        
test-auto-4        15 mins ago  4x GCP(n1-highmem-8)      UP      0 min     sky launch -c test-auto-4 --cloud gcp ...  
sky status  3.42s user 0.84s system 97% cpu 4.369 total
  1. sky status will be set to STOPPED before all the nodes are stopped. (expected)

TODO:

  • Add testing
  • Add document

Tested:

  • Configure auto-stopping
  • Auto-stop after idle_minutes
  • restart cluster will reset auto-stop setting
  • sky status when remote instance is stopped by auto-stopping
  • sky start after stopped by auto-stopping
  • run the following commands for auto-stopped clusters:
    • sky queue
    • sky status
    • sky autostop cluster_name -i 0: correctly skipped
    • sky exec cluster_name -- gpustat
    • sky launch -c cluster_name yaml
    • sky logs
  • Rendered locally

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete look, will finish reading later - this is exciting @Michaelvll !

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/skylet/configs.py Outdated Show resolved Hide resolved
sky/skylet/autostop_lib.py Outdated Show resolved Hide resolved
sky/skylet/autostop_lib.py Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done the first pass.

Question: after a cluster has been autostopped, now the user runs sky start, is there code updating its autostop value to -1 minutes?

sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Question: after a cluster has been autostopped, now the user runs sky start, is there code updating its autostop value to -1 minutes?

The local status will be set to -1 when it finds the cluster in STOPPED status.

The remote autostop config will have a boot_time together with the autostop idle_minutes, if the boot_time is not equal to the current boot_time, the autostep idle_minutes will be disregarded.

@Michaelvll Michaelvll marked this pull request as ready for review March 29, 2022 20:25
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits!

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
sky/skylet/events.py Show resolved Hide resolved
sky/skylet/log_lib.py Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

sky/cli.py Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
sky/skylet/events.py Show resolved Hide resolved
sky/skylet/events.py Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Mar 31, 2022

Thank you @concretevitamin for the detailed reviews! I think the current version may be good to ship. I am finally running the tests/run_smoke_tests.sh before merging.

  • tests/run_smoke_tests.sh (except azure_start_stop due to subscription issue)

@Michaelvll Michaelvll changed the title [WIP] Auto stop for cluster Auto stop for cluster Mar 31, 2022
@Michaelvll Michaelvll merged commit 569a09e into master Mar 31, 2022
@Michaelvll Michaelvll deleted the auto-stop branch March 31, 2022 06:58
michaelzhiluo added a commit that referenced this pull request Jun 6, 2022
* Fixes for document and README (#360)

* Update the installation and tutorial

* fix

* Add description in README

* fix

* Add link to examples

* Polish readme

* Update readme

* readme

* Fix comment

* Update links in documentation and README (#366)

* Improve adaptors (#359)

* gather adaptors

* give instructions

* Limit grpcio version for #369 (#370)

* Limit grpcio version

* rename disk_size

* format

* Remove typo from README (#371)

* Fix Azure credential errors on AWS (#372)

* fix typo (#376)

* Restore README to the original place (#374)

* restore README to the original place

* Support 'name:cnt' accelerators spec in YAML (#396)

* Support 'name:cnt' accelerators spec in YAML

* Fixes #373: 'sky start/down' should error out

* README example: change `K80:4` to `V100:1` (#399)

Reasons:
- `K80:4` is not available on AWS, arguably the most common cloud our target users use (so they will hit resource unavailable)
- `V100:1` is available on all three clouds and is a popular GPU

* Workdir Docs (#405)

* Ok

* Addressed all comments

* Changed to new git link

* ok

* Fix `check_local_gpus` (#411)

* Fix check_local_gpus

* Break a line to meet 80 char constraint

* Address the review comments

* Sky Storage CLI (#338)

* Initial Draft

* Delete bad file

* ?

* ???

* format

* sky storage status; sky storage down

* Fixed

* Done

* Addressed Comments

* Addressed Comments, TODO: Documentation

* Documentation added

* ok

* Sky Storage CLI - Polishing (#414)

* Initial Draft

* Delete bad file

* ?

* ???

* format

* sky storage status; sky storage down

* Fixed

* Done

* Addressed Comments

* Addressed Comments, TODO: Documentation

* Documentation added

* ok

* Addressed Zhanghao's comments

* Fix

* SGTM

* Decouple `ray up` and user's file_mounts / setup (#407)

* wip: Add setup in provision pipeline

* Fix gcp/azure

* remove useless variables

* minor fix

* Add some TODOs

* Fix comments

* Fix comments

* Fix gcp/azure initialization_commands

* Remove setup from template

* Fix setup directory

* Change rsync back to -Pavz

* Remove unused argument

* fix file_mount/dir_mount

* Minor fix (#419)

* Revert to using file_mounts for object mounting (#412)

* WIP Debug

* revert file_mounts using storage mounts and update docs

* remove print

* Fix credential cmd

* lint

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* [docker] Quick fix to remove already-dead docker containers from `sky status` (#427)

* A quick fix in killing sky docker containers

* Add a comment

* Update documentation with provisioning quickstart (#415)

* Add improved progress meters + log suppression (#425)

* Add cross-cloud failover to docs. (#433)

* Add document for `bash script.sh` cannot do `conda activate` problem (#422)

* Add conda activate support to bashrc

* Add doc and make sure conda activate works

* bring back conda activate command for GCP

* Move comment to quickstart

* format

* Fix comments

* Add test/example of using user_script

* Fix indents

* bash -i only for conda activate

* Fix the SKY_NODE_IPS fail to pass to the shell script

* Update readme

* update env_check

* Fix comments

* Change to head -n1

* Remove `-i` for `bash user_script.sh` (#436)

* Remove -i option

* Fix docs

* Fix run.sh

* add comments

* fix comment

* format

* Fix Azure key generation (#429)

* Fix az key generation

* Add private keys to retval

* [Storage/File-Mounting] Fix Symlink Issue + Fix File Mounting (#431)

* ok

* Shorten YAML

* Ok

* Done

* Nit added

* Romil's changes

* Fix error message when azure-cli is not installed (#424)

* Azure import

* Add docs.

* Fixing Relative Directory for Workdir and File Mounts (`~/sky_workdir/...` not `~/sky_workdir/workdir/...`) (#443)

* Fix

* Resnet Example working

* Fix

* Fix gitignore for rsync (#448)

* Fix gitignore in file_mounts

* Add test

* Update j2 for filter

* format

* Remove tracebacks for exceptions to improve UX (#441)

* Remove tracebacks

* Fix job fail color

* fix comments

* Hide tracebacks

* Fix #442

* fix `workdir` becomes `~/sky_workdir/workdir` #442

* add logging error for job_id problem

* format

* update error message for retry

* Update docs

* Fix login

* Add more checks

* format

* fix return type

* format

* refactor returncode handling

* Update return handling

* Fix filemount testing

* Fix delayed logs of `sky logs` locally vs `tail -f` remotely (#454)

* Switch to ray job logs

* Optimize log tailing

* format

* Fix exec logging

* Add comment

* Move back to run_with_log

* Bring back our tailing function for progress bar

* format

* Fix comments

* Remove check argument from run_with_log

* Add comment

* Add comment

* lint

* [Storage/File Mount] Check `workdir` and Filemount `src` Size (#440)

* Check  Workdir size

* Temp

* Done

* Addressed comments

* Fix assertion on Azure's A100/V100 and fuzzy resource search (#368)

* Fix assertion on Azure's A100/V100

* Fix azure catalog

* Add simple fuzzy search for accelerators

* Fix issues and add colors

* Update azure.csv to fix bug

* Only keep one candidate, cleaner msg

* yapf

* Make one line msg

* refactor cloud check

* Address comments and add more hints

* yapf

* fix space

* Address comments

* Fix using_file_mounts (#465)

* Improve docs / logging messages. (#466)

* Minor: improve setup logging.

* CLI messages fix.

* Update docs on code sync, artifacts

* Minor: remove a new line from a message

* Update docs

* Fix `sky logs` with interactive ssh causing job status wrongly set if ctrl-c'ed (#473)

* Add warning for ctrl-c

* Make `sky logs` read only

* Fix logs

* add comment

* Pricing information for `sky show-gpus` (#472)

* [Azure] Downgrade the image version for K80 instances (#460)

* Downgrade the image version for Azure K80 instances

* Remove a reference link

* Minor refactoring

* Consider non-gpu instances

* Add clouds=azure

* Minor fix

* Remove unnecessary if statement

* Minor fix

* Minor fix in var name

* Minor fix

* Address comments

* Centralize sky_logs, polish logging, and update `sky start` documentation (#432)

* Gcloud Authentication Bug Fix (#437)

* testing

* Fix

* No retry

* fix tail_logs (#476)

* Mention .gitignore for workdir sync in docs (#470)

* Suppress rsync output (#459)

* Suppress rsync output

* Fix spaces

* fix path

* fix

* Remove rsync logs

* fix logging

* Change order

* Small fix for the cloud_stores cli installation (#477)

* Refactor cloud storage

* Test aws before pip install

* refactor

* Add test for s3 bucket

* fix comment

* Revert "Small fix for the cloud_stores cli installation" (#479)

* Revert "Small fix for the cloud_stores cli installation (#477)"

This reverts commit 1a0bcc141c5cf3892ffa116eaf24e7fdb6bd7c5a.

* Add back the aws cli check

* Add back s3:// check in file_mounts

* format

* Fix hint for multiple instance candidates (#475)

* Add confirmation prompt for cluster management operations (#471)

* Minor fix for sky launch --gpus tpu-* (#481)

* Docs: polish installation & quickstart (#478)

* Update docs installation / README

* quickstart polish

* Polish initial messages

* Spelling

* Updates

* Fix comments

* Fix prompting for launching on a stopped cluster. (#487)

* Fix credential mounting (#483)

* Fix credential mounting

* format

* Chanage back to ~/.config/gcloud

* Add exclude files in gcp credential mounting

* format

* Change using_file_mounts to multinodes to let it check more things

* Fix docs

* format

* Fix gcp installation hints

* Parallel Setup + Filemounting (#458)

* Parallel Setup

* Done

* Done

* Fix color

* Fix

* wow

* Better parallel solution

* Better

* using imap

* Tested with failed setup

* format

* Update the workdir uploading logic

* format

* Fix indent

* Fix comments

* Update log

* Remove context manager

* Change logging

* Fix doc

* Use context manager

* Add exception

* Remove num_threads limit

* Update comment

* Format

* lint

* Add timing

* format

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Fix Azure Promo Instances (#485)

* Fix azure data fetcher

* Slightly safer

* Add safeguard for missing price

* Fix

* Simplify

* Better impl

* Polish workdir/file_mounts validation and logging. (#495)

* Polish workdir/file_mounts validation and logging.

* Fix cloud URIs being displayed with an extra slash

Previous:
   gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024/ -> /train-00001-of-01024
Now:
   gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024 -> /train-00001-of-01024

* Fail early for non-existent local file mount sources.

* !r

* Docs: polish quickstart, interactive-nodes (#491)

* Polish interactive-nodes.rst

* Polish quickstart, interactive-nodes

* Address comments

* Address comments

* Fix rsync_exclude for gcp credential file_mounts (#496)

* Fix rsync_exclude for gcp credential file_mounts

* fix comment

* address comment

* Fix: return when there are no matched clusters (#500)

* Make GPU/TPU names case-insensitive (#463)

* task.py: validate workdir by expanding full path (#505)

* Revamp docs (getting started; use cases). (#503)

* Revamp docs (getting started; use cases).

* task.py: validate workdir by expanding full path

* romil comments

* fixes

* comments

* zhanghao's comment

Co-authored-by: Romil <romil.bhardwaj@gmail.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Fix ulimit for provision (#512)

* fix ulimit for aws

* update templates for ulimit

* Add comment

* README: remove quick example.  (#521)

* README: remove quick example.  Added notes for developers.

* Remove pip install azure-cli==2.30.0 (already in setup.py)

* Reword

* [docker] Change remote working directory into `/sky_workdir` (#497)

* Fix docker workdir name

* Revert the change in create_dockerfile

* Minor fix

* Fix Azure resource leak (#501)

* fix

* comment and lint

* Timeout for multi-node launching (#504)

* add timeout for waiting cluster ready

* Timeout on launching nodes

* Fix gcp azure upscaling speed

* Timeout only checks node launching

* fix logging

* format

* Redirect the ray status to log file as well

* format

* replace ray exec

* fix comment

* add comment

* Fix ssh_credential

* Fix ip fetching

* Fix count _worker logic

* Refactor

* fix

* format

* bug with yapf

* add comment

* remove unused import

* wording

* Add retry for head ip fetching at the first time

* Fix get head ip

* fix

* fix

* fix assert

* address comments

* format

* change back to info mode

* address comments

* fix color

* [docker] Fix a bug in updating image list (#525)

* Fix help str for aws/azure credential check (#532)

* Fix help str for azure credential check

* Fix aws check

* format

* Fix TPU logging when failed to launch (#526)

* Fix tpu logging info

* format

* fix index

* fix

* change to zone_str

* Add progress for TPU launching

* change back to original logging

* Fix console

* Add clear

* move console back

* Indent

* restore uneccessary changes

* format

* Fix aws apt install (#529)

* Add kill dpkg

* Add comment

* optimize a bit

* Add tree back to using_file_mounts

* Docs: more polishing. (#523)

* Polish grid-search.rst

* Polish more; address #503 comments

* Revamp - Syncing Code and Artifacts

* Address comments

* Polish

* Add copy button to code blocks in documentation (#534)

* Fix non-interactive SSH commands in task setup (#533)

* Simple SSH into Multi-node Workers from Local (#469)

* Done

* format.sh

* temp

* Done

* Remote SSH

* Fix

* Addressed comments

* Addressed Zhanghao's edge case

* ok

* Fix

* Docs: significantly de-ink quickstart (#539)

* Docs: significantly de-ink quickstart

* More polishing.

* deleted:    source/reference/iterative-development.rst

* Address comments.

* Revert "Simple SSH into Multi-node Workers from Local (#469)" (#552)

This reverts commit d2774206e1eedb1b79a22cab06b2ccc1df807691.

* Fix the dpkg lock for apt install (#554)

* Disable unattended-upgrade

* minor: set ray version constraint

* Disable unattended-upgrades for azure as well

* Address comments

* fix comments

* Added FileLock on cluster launch and teardown (#510)

* added FileLock for cluster launch

* Run yapf and pylint

* made suggested changes to locking during provisioning and teardown

* run format

* removed use of pathlib

* test to see if updating filelock version works

* updated version to >=

* disabled pylint warning for filelock & left notes to upgrade when python version is upgraded

* applied formatting

* moved pylint disable to block level

* fixed what pylint ignored

* use handle.cluster_name

* Polish quickstart. (#560)

* Fix the lock logic for provision (#562)

* Fix the lock logic for provision.

* format

* lint

* Improve instance hints by moving messages to optimizer & speedup sky commands (#493)

* Improve instance hints by moving to optimizer

* Fix

* Fix

* Fix cpunode

* Remove traceback message

* Merge master and improve speed

* Fix tests

* Fix test and uppercase TPU

* case-insensitive check for string matching

* yapf

* Fix

* simplify

* Address comments

* Refactor

* Fix message and types

* AWS Ray Setup - Alias Python=Python3 if Alias does not exist (#558)

* Alias python for python3

* ok

* Pip

* ok

* Improve optimizer message (#571)

* Fix sky behavior when weird behavior of `ray status` for GCP occur (#574)

* Fix Weird behavior of `ray status` for GCP #573

* fix comment

* address comment

* Allowed forced teardown when called within locked code (#582)

* Allowed forced teardown when called within locked code

* hopefully fixed pylint issue

* removed forced down from try catch for timeout

* Set $TPU_NAME during provisioning (#587)

* Sky Installation for Mac <1.15 Warnings + Doc (#569)

* Mac fix

* Fix

* Fix

* done

* Ok

* quotes

* Fix prompt, add fallback retry for INIT cluster (#559)

* Fix loggings and fallback logic

* Fix fallback

* Fix loggings

* format

* Rename the variable

* format

* Add assert

* Fix dryrun

* remove assert

* address comments and handle UP status

* format

* Add comments

* lint

* fix lock for cluster status change

* format

* address comment

* address comment

* fix smoke tests

* Add a TODO

* fix stopped multi-node being terminated

* format

* fix function name

* stop/terminate for head_failed as well

* Query ip error

* format

* format

* add comment

* status back to up

* format

* add back assert

* fix tpu merge issue

* Fix CLI not installed for `sky check` (#592)

* add hint for installing gcloud

* fix indent

* Hint in sky check as well

* fix subprocess output

* Fix azure/gcp check

* fix import

* Fix azure check

* Remove output

* Address comment

* update info

* remove tpu gcloud dependency, due to the fix of `sky check`

* update optimizer info

* format

* Upgrade aws ami and add back us-west-1 (#564)

* Upgrade aws ami

* fallback to lower nvidia driver version for K80

* remove print

* add back us-west-1

* Fix order of setup (#593)

* unique cluster name list for sky down (#596)

* SSH into Worker Nodes from Local (#557)

* Done

* format.sh

* temp

* Done

* Remote SSH

* Fix

* Addressed comments

* Addressed Zhanghao's edge case

* ok

* Fix

* Fix

* Fix

* Doc changes

* Comments incorporated

* Glob cluster name for sky down (#598)

* glob search for `sky down`

* add doc

* format

* address comments

* Fix cuda version for tf training (#603)

* Add `conda init` to AWS setup commands (#604)

* Add conda init to aws setup cmds

* Move the line upward

* Fix typo

* Created and handled teardown success bool (#581)

* Created and handled teardown success bool

* better error messages + return true + include stop

* incorporate _force

* formatting

* minor fixes

* Add instructions in document for quota increase (#588)

* Add instructions for quota increase

* Add hint for azure subscriptions

* Address comments

* update

* Add fine-grained logging during provisioning (#565)

* Remove job_submit.log & clarify `sky exec --help` (#611)

* Remove job_submit.log

* Clarify `sky exec --help`

* Update comments

* Parallel runner of smoke tests. (#584)

* run_smoke_tests.py: parallel runner of smoke tests.

* gcp-tpu-delete.sh.j2: remove --async to avoid race conditions.

* Update.

* `sky logs --status`: exit with appropriate exit code.

* test script: echo trick; query job statuses; cluster names

* Use pytest for smoke_test

* ignore smoke for github

* Add file lock fo the wheel building

* Fix file_mounts for ubuntu 20

* format

* Fix job_status

* Update readme

* Print logs while tests are running.

* Update test script

* git rm examples/run_smoke_tests.py

* Minor

* Fix 'sky down non_exist_cluster' message.

* move wheel lock to temp folder

* Update logs for testing

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update GCP image for OS consistency and fix GCP's spot CPU (#614)

* Update GCP image and Fix spot CPU

* Fix worker nodes and use same image for K80/V100

* Fix worker

* Fix cpunode

* Optimize the `sky job queue` when large amount of jobs running (#616)

* Only update job status during provision

* format

* Only update job_status when previously stopped

* address comments

* fix comment

* fix merge error

* longer controlpersist

* apt update for aws (#620)

* add apt update for aws

* address comment

* address comment

* [Docker] Install sudo in docker (#595)

* Alias sudo for docker

* lint

* fixes

* Install sudo instead

* Refactor JobLibCodeGen and fix stale job for restarting cluster (#621)

* Refactor JobLibCodeGen and fix job status update logic

* Fix job_lib

* fix job lib again

* Add comment

* fix job_lib

* fix update_status

* Fix INIT status update

* format

* fix

* Add comment

* add assertion

* address comments

* Address comment

* Make azure-start-stop use 1 node to speed up tests. (#619)

* Make azure-start-stop use 1 node to speed up tests.

* Add --num_nodes to both launch and exec.

Tested:
- sky launch --num_nodes=2 'echo $(hostname)' -c test
- sky exec test 'echo $(hostname)' --num_nodes=2
- sky exec test 'echo $(hostname)'

* Support test_azure_start_stop_two_nodes() under --runslow.

* Support passing a test name.

* Make Task.num_nodes a property and validate it.

* Fix hint msg (#634)

* Make test cluster names unique for (user, mac address). (#639)

* Make test cluster names unique for (user, mac address).

* Make sky logs --status print job id.

Ex:

Job 1 SUCCEEDED
Job 2 SUCCEEDED
Job 3 SUCCEEDED
Job 4 SUCCEEDED
Job 5 SUCCEEDED
Job 6 SUCCEEDED
Job 7 SUCCEEDED
Job 8 SUCCEEDED
Job 9 SUCCEEDED
Job 10 SUCCEEDED
Job 11 SUCCEEDED
Job 12 SUCCEEDED
Job 13 SUCCEEDED
Job 14 SUCCEEDED
Job 15 SUCCEEDED
Job 16 SUCCEEDED
Terminating cluster test-multi-echo-zongheng-fe6d...done.

* Roll back to debian-based image (#636)

* Fix the second `sky launch` assertion and ray job status for failed job (#638)

* Fix job fail status in ray job

* Fix second `sky launch`

* format

* add launch again in the smoke test

* address comments

* Wait all the logs

* Fix get_status

* fix print

* format

* Add disk_size to YAML ref; other minor cleanups. (#643)

Closes #546.

* Add a skylet daemon and fix job status problem (#623)

* Refactor JobLibCodeGen and fix job status update logic

* Fix job_lib

* fix job lib again

* Add comment

* fix job_lib

* fix update_status

* Fix INIT status update

* format

* add daemon

* fix

* Add comment

* start skylet

* format

* pylint

* fix skylet start in template

* Fix job fail status in ray job

* Fix second `sky launch`

* format

* fix skylet checking in the test

* fix skylet launching

* remove -v for ray up

* format

* address comments

* Only teardown the cluster when test succeeded

* fix space

* Align underscore / dash for num-nodes cli option (#654)

* align underscore with dash for num-nodes

* fix num_nodes

* Optimize backend_utils.get_node_ips(), esp. for Azure. (#630)

Azure's APIs are extremely slow; as a result, ray get-head-ip and the
like are very slow for Azure clusters.

The below is for a 1-node Azure cluster.

Before: takes 1min 14sec

  » sky exec b2 --workdir=. -- <cmd>
  I 03-22 15:45:54 cloud_vm_ray_backend.py:1296] In sync_workdir
  I 03-22 15:47:08 cloud_vm_ray_backend.py:1301] Done get_node_ips()

After: instant

 » sky exec b2 --workdir=. -- <cmd>
  I 03-22 15:54:59 cloud_vm_ray_backend.py:1296] In sync_workdir
  I 03-22 15:54:59 cloud_vm_ray_backend.py:1302] Done get_node_ips()

* Minor touches on docs + improve install UX (#660)

* Minor touches on docs.

* Remove awscli pinning to not download a bunch of boto3 versions.

* Remove awscli pinning in cloud_stores

* Refactor type_checking (#655)

* refactor type_checking

* Address comments

* Fix resources_lib

* Build Sky local wheel in a unique tempdir per launch. (#657)

* Build Sky local wheel in a unique tempdir per launch.

* Refactor wheel cleanup

* reorg statements

* Fix caller.

* Tear down head node even for HEAD_FAILED. (#661)

* Added sky down --purge (#635)

* added sky down --purge

* made suggested edits

* minor formatting and changes

* fixed force

* output formatting fix

* Parallel sky down (#659)

* fix multi-thread

* refactor

* Address comment

* format

* hidden variable

* Progress bar for termination

* fix

* format

* mitigate logging problem

* rename

* rsync: --filter on .git/info/exclude (#652)

* rsync: --filter on .git/info/exclude

* Update docs.

* Use --exclude-from, and check if git exclude exists

* Update docs

* Fix repeating IP Address bug (#663)

* Fix output for parallel down (#666)

* Fix output for parallel down

* format

* linting

* fix import

* Auto stop for cluster (#653)

* refactorize skylet

* implement autostop event without cluster stopping

* wip

* Remove autostop from yaml file

* fix naming

* fix config

* fix skylet

* add autostop to status

* fix state and name match

* Replace min_workers/max_workers for gcp

* using ray up / ray down process

* fix stopping

* set autostop in globle user state

* update sky status

* format

* Add refresh to sky status

* address comments

* comment

* address comments

* Fix logging

* update help

* remove ssh config and bring cursor back

* Fix exec on stopped instance

* address comment

* format

* fix

* Add test for autostop

* Fix cancel

* address comment

* address comment

* Fix sky launch will change autostop to -1

* format

* Add docs

* update

* Refactor DAG Optimizer (#628)

* Refactor optimizer

* Remove unnecessary import

* yapf

* Minor fix

* Add NotImplementedError

* Minor

* Rename vars & Annotate types

* Minor fix

* Minor

* Minor fix

* Fix type annotation

* yapf

* [Minor] Address comment

* Add type alias & enhance comments

* yapf

* Fix minor error in dag_lib.Dag

* Add is_chain to Dag

* Address comments

* yapf

* yapf

* Address comments

* Add total in optimizer msg

* Add a comment in is_chain

* Address reviews & Fix egress msg

* yapf

* Minor fix

* Fix egress msg

* yapf

* obj -> objective

* pass yapf

* cost -> cost/time

* Improve UX for autostopping (#676)

* Add progress bar for status refreshing

* Keep autostop after refreshing

* Add glob for start

* Fix message for autostop

* Fix messages for autostop

* Improve logging in error conditions & update auto-stop.rst (#675)

* Log error for HEAD_FAILED; don't duplicate logging for no_retry=True.

* Minor touches on auto-stop.rst

* Revert to only printing errors on GANG_FAILED

* Add GLOB for sky queue (#678)

* add glob for sky queue and start

* format

* Added price to sky status (#561)

* Added price to sky status

* put region and hourly price behind -a in sky status

* removed whitespace

* cache cluster region

* some touches + added computation to constructor

* forgot one fix

* formatting

* Add line processor abstraction and fix gitignored path size (#615)

* ILP-based DAG Optimizer (#637)

* Refactor optimizer

* Remove unnecessary import

* yapf

* Minor fix

* Add NotImplementedError

* ILP-based optimization

* yapf

* Add pulp in setup.py

* Minor

* Rename vars & Annotate types

* Minor fix

* Minor

* Minor fix

* yapf

* Fix type annotation

* yapf

* [Minor] Address comment

* Add type alias & enhance comments

* yapf

* Fix minor error in dag_lib.Dag

* Add is_chain to Dag

* Address comments

* yapf

* yapf

* Address comments

* Add total in optimizer msg

* Add a comment in is_chain

* Address reviews & Fix egress msg

* yapf

* Minor fix

* Fix egress msg

* yapf

* obj -> objective

* pass yapf

* cost -> cost/time

* Add random DAG generator

* Add random DAG generator

* Change variable names

* Minor fix

* yapf on test_random_dag.py

* Add docstring

* Rename

* _optimize_cost -> _optimize_objective

* Minor

* Default num_tasks to 10

* Add docstrings & Fix variable names

* yapf

* Minor

* Improve test_optimizer_random_dag

* yapf

* Fix optimizer

* Add docstring about ILP objective

* fix typo

* yapf

* Minor

* Add monkeyptach

* Fix docstring

* yapf

* Touches on docs. (#684)

* Touches on docs.

* Touches

* touches on yaml-spec

* update --gpus=all

* extend underline

* Storage mounting (#658)

* squash

* fix

* yapf workaround

* Update artifact syncing docs (#689)

* Update docs

* comments

* Docker example and fix goofys-docker mounting (#686)

* Fix docker killing

* Add docker example

* Fix docker example

* Fix

* Fix the docker example for pytorch installation

* Use model caching

* Mount output folder

* Permission issue

* remove useless lines

* fix license

* Add storage mounting for output and fix the goofys mounting

* Minor touches

* examples/docker_app.yaml -> examples/detectron2_docker.yaml

* Minor

* Fix gcp fuse.conf

* simplify file_mount options

* remove wait

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Add faq.rst; move CLI section to the bottom. (#690)

* Speedup ci/cd with parallelism (#693)

* Testing for different os and python version

* downgrade python

* speedup testing

* Remove 3.9

* generic workflow

* remove mac and add caching

* Verify acclerators and support float for inline GPU requirement V100:0.5 (#698)

* Verify acclerators and support float number for inline GPU specification V100:0.5

* format

* Add testing for the cli

* Remove unused function

* fix test

* Case insensitive gpu checking

* Storage Subdirectory Fix (#709)

* fix

* Ok

* Minor: suggest using a conda env when installing Sky. (#710)

* Fix optimizer messages (#711)

* Fix optimizer msg

* yapf

* Fix optimizer msg

* any -> all

* Delete hourly

* Minor

* Prompt Before Storage Initialization for `sky launch` (#701)

* Fix pushed

* Fix

* Fix sky start when cluster is autostopped without`sky status --refresh` (#713)

* Fix sky start when cluster is autostopped and `sky status --refresh` is not called

* format

* lint

* lint

* Rename status function

* Fix the hardcoded gcp project id (#692)

* Fix gcp project id

* Refactor

* Move azure subscription id to auth

* project id back to backend_utils

* Add todo

* Fix azure subscription id (#720)

* Skip cloud when instance type is provided and make resources immutable (#714)

* cloud is not required when instance_type set

* format

* Make Resources immutable to avoid verification problems

* update

* Fix version

* lint

* fix test

* Address comment

* Fix version

* oops

* Move accelerator_args setting to the _set_accelerators

* Remove default

* Instruction for file_mounts trick (#694)

* instruction for file_mounts trick

* address comment

* address comments

* Specify region in resources (#722)

* cloud is not required when instance_type set

* format

* Make Resources immutable to avoid verification problems

* update

* Fix version

* lint

* fix test

* wip

* Address comment

* Fix version

* Add region_limit to resources

* Add dryrun region test

* Add test_region

* Fix region

* format

* Add case insensity check for region

* fix region in config_dict

* fix test

* address comments

* Update doc

* address comment

* Ship AWS cloud provider (#725)

* ship aws

* adjust

* update template

* update LICENCE

* Goofys memory optimizations (#726)

* comments

* yapf

* Make sky exec submit jobs for inline commands; guard Azure disabled subscription error. (#727)

* Guard against Azure disabled sub error; tested with: sky gpunode

* Make `sky exec` submit jobs for inline commands as well.

* YAPF

* Fix exec examples + check empty entrypoint

* Enable empty string entrypoint for 'sky launch'.

* Add job duration and resources (#729)

* Add job duration and resources

* address comments

* Fix job status terminal -> non-terminal

* Change end_at to null if status is not terminal

* Default value to null for end_at

* Address comments

* UX improvement for sky stop and down (#734)

* UX improvement for sky stop and down

* Change skipped to be yellow

* yapf

* Fix AWS multi-node failure (#736)

* Fix TPU naming issue (#737)

* Format duration for jobs (#744)

* Format duration

* format

* Address comments

* Shared default security group for AWS (#731)

* shared default security group for AWS

* update

* update name

* Add TPU pods (#739)

* Init fix

* Fix

* quick fix

* Add notes

* Add us-east1 region

* Add assertion on multi-node TPU

* Fix

* Fix small nits in code (#748)

* Fix race condition for skylet daemon (#747)

* Fix race condition

* update comment

* Fix submitted_at

* Add retry for sky logs

* format

* Fix retry for job log

* Fix the setup progress bar and conda confirmation message (#746)

* Fix setup progress bar and confirmation of conda

* minor fix

* Address comments

* Hack to remove bash warning

* rm

* Fix the pipe output after ctrl-c

* Get rid of cloud dependencies in the config template (#749)

* Add W&B setup in FAQ (#751)

* Add W&B setup in FAQ

* Reflect comments

* Fix typo

* Small refactor of CLI and cloud (#752)

* Fix resources check

* Automatic cloud registry and task_option override

* fix test

* provide option to reset the setting

* Rename the option adding function

* Fix dummy cloud

* fix

* cloud register

* address comments

* address comments

* Fix descendant processes termination for sky cancel (#758)

* fix children processes termination

* fix comment

* fix sig

* fix PIPE kill

* Fix PIPE kill

* format

* Minor logging fix. (#760)

* Add test_cancel() to smoke. (#761)

* Support glob patterns for jobs for `sky logs -s` (#685)

* allow globbing when calling sky logs -s

* formatting

* removed whitespace

* small fixes

* final fixes

* Minor: update README (#762)

* Reformat `sky status` codepath and minor fixes (#721)

* refactor sky status codepath and minor fixes

* moved things over to status_utils

* fix conflicts and styling errors

* final changes

* added inits

* A quick fix for CLOUD_REGISTRY.from_str for None (#766)

* Fix killing the whole session problem (#767)

* Fix kill tmux issue

* Add comments

* Fix gpu issue

* Add test and fix

* format

* Add comment

* Add doc for gcloud 400 error (#764)

* Add gcloud 400 error hint

* add command

* Fix sky cancel (#770)

* Fix format all (#768)

* fix format all

* Managed spot (alpha) (#759)

* wip

* Fix resources check

* Automatic cloud registry and task_option override

* fix test

* provide option to reset the setting

* Rename the option adding function

* Fix dummy cloud

* fix

* wip

* Add Spot CLI support

* fix spot controller logic

* Add spot_recovery example

* fix strategy

* add todo

* Fix status

* add spot status

* fix todo

* Fix merge error

* Fix status

* fix signal

* format

* wip: integrate sky spot launch

* wip: spot launch integration

* Add autostop for the task controller

* Add spot_status cli

* Fix spot status

* fix spot status

* wip: add spot cancel

* fix spot status

* Fix

* disable autoscaling for spot instance

* Add tests and fix yaml specs

* fix tests and make controller resources unspecified

* fix test

* format

* Fix test spot

* format

* Fix resources setstate

* Fix empty run

* format

* Skip empty run section

* Add network check when doing status refresh

* Fix logging

* Remove buggy job not submitted yet

* Fix status refresh

* Add field check for task

* Align job ids

* Fix failover when recovering

* Address part of the comments

* Address part of the comments

* address comment in backend_utils

* Adress part of the comments

* Fix cancelled job duration

* address comments

* Rename status.submit

* Fix cancel

* logging info

* Allow sky spot status to be run when a spot job is launching

* Add status cache and show job status after launch

* Fix failover

* Remove spot-controller from status

* Disable azure use_spot

* Enable gcp

* Fix optimizer dag dummy node

* Fix setup and recovery

* Merge branch 'master' of github.com:concretevitamin/sky-experiments into managed-spot

* format

* Fix optimizer test

* Address comments

* address comments

* Fix exception catch for job cancel

* Handle unexpected failure

* address comments

* address comments

* Default to use_spot for spot_launch

* Fix smoke test

* Fix smoke test

* fix the message after all retry fails

* Add back status check to smoke test

* format

* fix managed spot test

* format

Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu>

* Reuse sky wheels for efficiency (#769)

* cache sky wheels

* Add spot status --all, fix default value for use_spot and fix GCP dependency (#772)

* Add spot status --all and fix default value for use_spot

* Remove useless TODO

* make gcloud available interactively

* Fix gcp dependency installation and sky check

* Add test for managed spot instance

* format

* Fix security group mismatch issue with autostop (#780)

* fix

* delete unused global var

* restore the original Ray implementation

* Spot cancel -a and spot status --refresh (#776)

* add cancel -a

* add sky spot cancel -a and sky spot status --refresh

* fix

* fix refresh

* fix return

* fix refresh

* address comments

* Disallow long cluster names. (#781)

* Disallow long cluster names.

This fixed a smoke test failure.

* Fix another test name.

* Remove storage_demo.yaml from smoke tests (#782)

* Remove storage_demo.yaml from smoke

* add todo

* Fix smoke test for accelerators (#785)

* Fix yaml_spec test for accelerators

* format

* Fixing some errors encountered in smoke tests. (#787)

* spot_state: fix a SQL syntax error.

Previously:

» sky spot cancel -y -n test-managed-spot-zongheng-fe6d-1
E 05-02 14:37:02 backend_utils.py:989] Traceback (most recent call last):
E 05-02 14:37:02 backend_utils.py:989]   File "<string>", line 1, in <module>
E 05-02 14:37:02 backend_utils.py:989]   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_utils.py", line 68, in cancel_job_by_name
E 05-02 14:37:02 backend_utils.py:989]     job_ids = spot_state.get_nonterminal_job_ids_by_name(job_name)
E 05-02 14:37:02 backend_utils.py:989]   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_state.py", line 177, in get_nonterminal_job_ids_by_name
E 05-02 14:37:02 backend_utils.py:989]     rows = _CURSOR.execute(
E 05-02 14:37:02 backend_utils.py:989] sqlite3.OperationalError: near "job_name": syntax error
E 05-02 14:37:02 backend_utils.py:989]
E 05-02 14:37:02 backend_utils.py:994] Command failed with code 1: python3 -u -c 'from sky.spot import spot_utils; result = spot_utils.cancel_job_by_name('"'"'test-managed-spot-zongheng-fe6d-1'"'"'); print(result, end="", flush=True)'
E 05-02 14:37:02 backend_utils.py:995] Failed to cancel managed spot job

* Guard against `sky status --refresh` race

* Increase check_network_connection() timeout to 3s.

* Fixes

* Fix smoke

* Add a comment

* remove filter in aws command and fix spot failover (#789)

* Add status check when cancelling managed spot job and disable ambiguous termination of reserved clusters (#784)

* add cancel -a

* add sky spot cancel -a and sky spot status --refresh

* fix

* fix refresh

* fix return

* fix refresh

* address comments

* Fix cli prompt for cancel -a

* Add status check for job and controller on spot cancelling

* Remove output from setup for spot controller

* Disable `sky stop --all` to stop sky-spot-controller

* Fix operation str and remove controller from --all

* Check reserved cluster for termination operations

* fix name check

* Update massage

* Add repr for name

* fix comment

* Disallow canceling on reserved clusters

* Add tests

* format

* address comments

* format

* fix storage dump/load

* fix smoke test for autostop

* Address comments

* format

* add assertion

* Fix output for cancel

* fix cancel handle

* Fix storage tests failing with parallel runners (#794)

* Change storage id to time_ns

* replace time_ns with time

* File mounts for managed spot jobs (#788)

* Fix storage from_yaml_config

* Add sky storage support for managed spot jobs

* add example for storage in managed spot job

* Fix testing for the spot storage

* format

* format

* Add todos

* Fix test

* Fix comments

* format

* Fix retry cnt

* Fix test name

* Fix test by adding flush

* delete storage after spot task finish

* persistent=false

* Add new line at the end

* address comments

* Minor: logging polishes. (#804)

* Minor: logging polishes.

* Revert storage.py to master; except StorageBucketGetError msg

* Minor fix to test_storage.py

* Logging fixes for `sky spot cancel` and `sky logs -s`.

* Fix copy_mount_str (#806)

* Fix TPU resource leak (#797)

* Fix tpu leak

* Add test

* TPU fixes: record `sky status` before provisioning tpu.

* Switch to v2 to save cost

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Add BERT and Resnet spot examples (#792)

* Add BERT and Resnet spot examples

* Fix

* Add lightning example

* Fix

* Update spot examples comments

* file mount storage

* Add resnet spot codes for version control

* yapf

* remove comments

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Add env option for sky launch and exec (#803)

* Add env option

* Fix task env config

* Add env doc and for spot_launch

* format

* add test for env vars

* address comments

* fix help str rendering

* Fail for unset env var

* format

* Fix race condition between job set_state and update_status (#805)

* Fix race condition when between job and update_status

* address comments

* address comments

* Fix bert qa example (#808)

* Fix `tpu_rc` unassigned bug; minor fix on task resources_str. (#810)

* Fix `tpu_rc` not assigned error, seen in `sky down`

* Fix bug: "resources={'K80': n}" not included in generated program

* Support streaming logs from the spot cluster through spot controller (#798)

* Streaming logs from the spot cluster through spot controller

* fix comment

* format

* Wait for job running for spot logs

* Fix job lib status check

* Add support for `sky spot logs` showing the latest log

* sleep for sky spot status

* remove uneccessary argument from execution

* Refactor

* Fix the keyboard interruption of sky spot logs

* format

* Add comments

* fix comments

* Add wait for the controller and job to be started for spot logs

* address part of the comments

* Fix job id logging

* update type hint

* Fix race condition when between job and update_status

* format

* address comment

* format

* Fix copy_mount_str

* Address comments

* fix logs

* address comments

* format

* Fix logs

* Fix repeat logs

* address comments

* Fix spot logs

* Fix logging

* refactor spot logs loop

* fix log

* Fix log_lib infinite loop printing "Job finished".

This appears to be an accidentally deleted line.

* UX: Remove a logging message

* fix spot_status

* fix comment

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Fix a rare failure of wheels build (#809)

There is a very narrow window in https://github.com/sky-proj/sky/blob/1fd81ef00884780cb476405e3f9da84fd05fbd47/sky/backends/wheel_utils.py#L43-L71
where ctrl+c would prevent cleaning up files. This causes a failure when the wheel is been built again. This PR fixes this.

- [x] Unit tests
- [x] https://github.com/sky-proj/sky/issues/656
- [x] smoke tests

* Avoid ray messing up terminal with misaligned outputs (#813)

* Fix ray mess up terminal output

* fix comment

* Fix tail_logs by remove stdin

* add ray's implementation link

* Remove input for subprocess daemon

* Install ray only when it is not installed on the remote instance (#811)

* Not install ray if ray already exists

* longer sleep time for cancel_pytorch

* Fix autoscaler benign assertion by patching (#815)

* Patch resource_demand_scheduler.py from Ray 1.10.0.

* Make multi_echo test autoscaler bug.

* Fix LICENSE, comments, test

* Change examples/multi_echo.py to use thread pool

* Make wheel paths stable to avoid disrupting certain running Sky tasks (#819)

* Make wheel paths stable to enable concurrent launch on same cluster.

* Message fixes in cli.py

* Make ray_patches real patch files (#821)

* WIP

* Make ray_patches/worker.py use Ray 1.10 formatting (but keep our changes)

* Make ray_patches real patch files.

* Fix logging

* Fix GCP project ID (#824)

* Fix GCP project ID

* yapf

* Move the STARTED column to sky spot status -a (#823)

* save and load spot status --all caches

* format

* swap path

* Removed different cached table for -a

* Fix signal handling with stdin=NULL (#818)

* fix signal handling with stdin=NULL

* Add ctrl-c message

* Handle ctrl-c and ctrl-z

* refactor

* revert refactor

* Fix spot logs with ctrl-c/ctrl-z

* Fix status showing

* remove catch

* format

* fix indent

* Disable process_stream by killing children processes

* Fix comment

* Add sleep for spot test to wait for status to be updated

* format

* address comments

* [Breaks existing AWS clusters!] Change AWS security group name (#826)

* [Breaks existing AWS clusters] Change AWS security group names

* typo

* Fix back incompat descriptions

* Error msg fix

* Fix sky spot status 'ago'

* Remove undesired autoscaling (#830)

* disable autoscale when upscaling_speed is 0

* fix patch file

* Fix the upscaling=0

* remove output

* fix patch

* Fixed multi_echo

* format

* Use real job time in the `sky spot status` (#827)

* use real job time

* fix

* address comments

* nit

* Fix compatibility of the patch (#838)

* Fix compatibility of the patch

* Add comment

* Fix file existance test

* Fix patching

* Fix comment

* patch again

* Remove unuseful comments

* Spot controller UX improvements (#839)

* Fix recovery_strategy.py missing import / var shadowing

* Change controller autostop to 30 mins

Useful for large scale launch debugging

* cli: better messages for downing controller

* Show in-progress counts in sky spot status

* Add a TODO about duplicate keys in task.py.

* yapf

* fix test

* UX for job duration (#840)

* Make job start/end/submit time to be float for accurate time

* Fix microseconds

* format

* Fix < 1 second

* format

* fix

* fix None problem (#842)

* Docs for spot jobs (#844)

* docs for spot jobs

* mention code/files sync

* typo

* typo

* Update docs

* fix duration

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Add cli doc for spot (#846)

* Add cli reference for spot

* address comments

* Fix spot price for GCP (#847)

* fix spot price for GCP

* format

* Fix sky cancel for on-prem mode (#775)

* Replace ps forest with pstree to handle on-prem

* remove pgid

* Fix

* Add tests for all clouds

* Python based subprocess daemon

* yapf

* rm subprocess_daemon.sh

* comment

* add setup and fix test

* replace workdir with git clone for test_distributed_tf

* Fix sky spot issue

* Needs more time for sky cancel to work

* Fix bug

* address comments

* Fix subprocess daemon bug (#850)

* Fix

* Simplify

* Fix A100 provisioning on GCP (#829)

* Fix A100 on GCP

* yapf

* eof

* Fix 16x A100 and spot

* fix name

* Clean codes

* Fix pylint error

* Fix

* fix gcp return code

* Address comments

* comment

* Add notes

* update note

* Fix bug

* Cluster status meaning in `sky status --help` (#843)

* Better optimizer plan logs (#860)

* Better optimizer plan logs

* Add minimize logging option

* Change MINIMIZE_LOGGING to False by default

* address comments

* update output

* print plan in topo order

* Fix docs build warnings and remove code search (#834)

* Add Jupyter notebook tutorial to docs (#841)

* Fix sqlite3 rename problem with older version (#868)

* Fix on-demand price (#866)

* Fix on-demand price

* format

* Change the default cpu instance for aws

* address comments

* Enable timeline recording for Sky (#833)

* timeline

* Fix Ray autoscaler's failure of gpu auto detection (#848)

* Fix Ray autoscaler failure of detecting gpu

* rename

* ensure echo only once

* Add retry-until-up feature for launch and start (#863)

* Add retry until up

* fix message

* fix message

* fix

* Add exponential backoff

* Add backoff for spot recovery

* Address comments

* format

* Fix merge error

* Fix message

* Fix message

* Add comment

* Sky docker image (#869)

* Sky docker image

* docs

* Address comments

* optimize build order and remove deps

* add .sky mount

* fix docs

* fix docs

* Upload only necessary credentials and add gpu to cpu mapping for GCP (#853)

* upload only necessary credentials and add gpu to cpu mapping for GCP

* Fix comments

* Fix api

* rename

* refactor

* hide variables

* format

* fix test

* Add n2 instances

* Fix power of two

* format

* Fix azure cancel test

* fix azure smoke

* specify credential files

* Address comments

* format

* Address comment

* Fix default image (#874)

* add check before ping (#876)

* Minor: reformat `sky show-gpus` output. (#877)

* Distinguish spot failure for user code and cluster/controller failure. (#862)

* Distinguish controller failure and user failure

* Add hints for getting error messages

* Fix

* update message

* rename to cluster failure

* message for cluster failed as well

* Fix failing

* address comments

* Add id for end of logs

* Split resource failure and controller failure

* Fix terminal state

* Address comments

* fix typo

* Add YAML schema check (#680)

* Add explanations on spot docs (#852)

* Add some docs

* update

* fix

* fix

* update

* address comments

* reorg

* reorg and add fig

* Add imgs

* fix

* update

* Fix typo (#880)

* Make `sky launch` prompting consistent with interactive nodes (#867)

* Make the spot job pending as soon as the job is submitted (#870)

* Distinguish controller failure and user failure

* Add hints for getting error messages

* Fix

* update message

* rename to cluster failure

* message for cluster failed as well

* Fix failing

* Add pending state for spot jobs

* Fix job id

* format

* address comments

* Add id for end of logs

* fix pending

* Add name and resources

* format

* Add failed status check for spot state

* Refactor the backend interface

* address comments

* fix status

* address comment

* Fix comment

* remove azureTokens.json from the credential list (#883)

* Fast removal of buckets

* Fast removal of buckets

* Replace os.system with subprocess

* Replace os.system with subprocess

* fix syntax

* fix syntax

* Addressed Romil's comments

* fix comment

* Fix comments 2

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Gautam Mittal <gautam@mittal.net>
Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
Co-authored-by: Zongheng Yang <concretevitamin@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <mickey584@naver.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu>
Co-authored-by: Mehul Raheja <rahejamehul@gmail.com>
Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants