Improve UX for logs to include SSH name and rank #1380

iojw · 2022-11-06T03:05:14Z

As discussed in #1369, this updates the debug logs to make it easier to figure out which node is the head node, which host name to use to ssh into a specific node, and also the rank of the node. With this PR, the debug output becomes the following:

(node-2, head, rank=0, pid=23862) 0
(node-2, head, rank=0, pid=23862) 172.31.95.105 172.31.81.91 172.31.85.131 172.31.89.126
(node-0, worker1, rank=1, pid=22755, ip=172.31.81.91) 1
(node-0, worker1, rank=1, pid=22755, ip=172.31.81.91) 172.31.95.105 172.31.81.91 172.31.85.131 172.31.89.126
(node-1, worker3, rank=3, pid=22760, ip=172.31.89.126) 3
(node-1, worker3, rank=3, pid=22760, ip=172.31.89.126) 172.31.95.105 172.31.81.91 172.31.85.131 172.31.89.126
(node-3, worker2, rank=2, pid=22757, ip=172.31.85.131) 2
(node-3, worker2, rank=2, pid=22757, ip=172.31.85.131) 172.31.95.105 172.31.81.91 172.31.85.131 172.31.89.126

with the following config:

resources:
  cloud: aws
  disk_size: 1024
  instance_type: m6i.large

num_nodes: 4

run: |
  echo $SKY_NODE_RANK
  echo $SKY_NODE_IPS

Thus, if the cluster is named mycluster, the user can easily figure out which host name to use to ssh into a node using the logs e.g. ssh mycluster-worker1 for the log corresponding to worker1.

To ensure stableness for the naming of workers, we first sort them by their external IP, then do a mapping from internal to external IP in the code generation step to determine the node name. This means that the node name is always the same for a cluster regardless of what value is passed via the num_nodes argument.

This also updates the logic for determining the rank of node to use the external IP of a node rather than the internal IP. However, as discussed in #1291, the rank is not always constant for a cluster, and is assigned on a per-task basis, as can be seen from the following output with num_nodes set to 2:

(node-0, worker3, rank=1, pid=24355, ip=172.31.89.126) 1
(node-0, worker3, rank=1, pid=24355, ip=172.31.89.126) 172.31.95.105 172.31.89.126
(node-1, head, rank=0, pid=26036) 0
(node-1, head, rank=0, pid=26036) 172.31.95.105 172.31.89.126

Questions

Do we still want to respect the task_name variable that is passed in and display it? The above output includes the task name.
Open to suggestions on how to further improve the UX!

Tested

Multi-node cluster
Single-node cluster

concretevitamin · 2022-11-06T03:14:21Z

This is awesome @iojw! Quick comments:

Maybe it's okay to get ride of node-<i> in the prefix now? It doesn't seem to help.
RE: "Do we still want to respect the task_name variable that is passed in and display it?" I feel it's okay to not display it in the logging prefix at all, for both multi- and single-node tasks. For the latter, it doesn't seem helpful to display head as well.

iojw · 2022-11-06T03:24:34Z

@concretevitamin I think one argument for keeping it in is that we currently use it to name the log path for multi-node clusters, and it is possible that users would want to figure out which log file contains the full logs for a node given the debug output. There might not be any better naming scheme for the logs since we do not have access to the rank of a node and its ip from outside the codegen class.

skypilot/sky/backends/cloud_vm_ray_backend.py

Line 3072 in a613b43

log_path = os.path.join(f'{log_dir}', f'{name}.log')

concretevitamin · 2022-11-06T04:50:20Z

Since log_path is eventually used inside the codegen

skypilot/sky/backends/cloud_vm_ray_backend.py

Line 402 in 3d3da86

log_path,

can we name the path using IP there?

iojw · 2022-11-06T05:40:52Z

@concretevitamin Ah yup! In that case I don't see any need for the notion of task name at all for multi-node tasks. I've updated the code to remove the notion of task name from the logs. The log files are now named according to the node name e.g. head.log, worker1.log etc.

The logs for multi-node tasks now look like so:

(worker1, rank=1, pid=23368, ip=172.31.87.209) 1
(worker1, rank=1, pid=23368, ip=172.31.87.209) 172.31.81.189 172.31.87.209 172.31.86.128 172.31.87.196
(worker3, rank=3, pid=23414, ip=172.31.87.196) 3
(worker3, rank=3, pid=23414, ip=172.31.87.196) 172.31.81.189 172.31.87.209 172.31.86.128 172.31.87.196
(head, rank=0, pid=26060) 0
(head, rank=0, pid=26060) 172.31.81.189 172.31.87.209 172.31.86.128 172.31.87.196
(worker2, rank=2, pid=23348, ip=172.31.86.128) 2
(worker2, rank=2, pid=23348, ip=172.31.86.128) 172.31.81.189 172.31.87.209 172.31.86.128 172.31.87.196

iojw · 2022-11-06T06:10:01Z

For single node tasks, the logs display the name of the task, or just task if no name is set.

(task, pid=32503) 0
(task, pid=32503) 172.31.95.145

concretevitamin · 2022-11-06T15:30:24Z

Thanks @iojw - one thought: how about we don't show the task name even for single-node tasks? This is to keep it consistent. Users can see their custom task names in sky queue.

iojw · 2022-11-06T18:02:43Z

@concretevitamin Ray does not play well with the name option being empty - if it is, the name in the logs defaults to the name of the remote function run_bash_command_with_logs instead. We can work around with passing in a space " " as the name, but this will lead to 2 spaces at the start of each log which is not ideal. Additionally, something to consider - it is not always the case that the task is executed on the head node for single-node tasks on multi-node clusters, so it may still make sense to include the node name. e.g. A single-node task may still be executed on worker1 or worker2.

What do you think?

concretevitamin · 2022-11-07T01:28:50Z

These are great points @iojw. How about for single-node tasks we just show head/worker<i> depending on which node is scheduled (no task.name)? This way we're consistent in both cases.

EDIT: thinking more, I think it may be better if for 1-node clusters, we just do what you suggested:

For single node tasks, the logs display the name of the task, or just task if no name is set.

(task, pid=32503) 0
(task, pid=32503) 172.31.95.145

For n-node clusters, for both 1-node tasks and m-node tasks we can go with the head/worker<i> prefix.

The rationale is if most users use 1-node clusters, there's one less concept ("head") to see and learn.

concretevitamin

Thanks @iojw! Did a pass.

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

iojw · 2022-11-07T22:08:58Z

Thanks @concretevitamin! To summarize the changes, there are no longer any differences between single-node and multi-node tasks on multi-node clusters in the logging output and log path. The difference lies between single-node clusters and multi-node clusters.

For single-node clusters:

For logging, we display the task name, or just task if the name is not set. (no rank)
We store logs at run.log
This is so that users who primarily use single-node clusters need not know the concept of head nodes at all

For multi-node clusters:

For logging, we always display node name along with rank, regardless of num_nodes
We store logs at node_name-ip.log, where IP is the private ip of the node

I've also updated ResourceHandle so that we able to cache the value of the stable_cluster_internal_ips variable after it is retrieved the first time. Unsure if there might be an issue when workers get restarted since we do not invalidate the cache.

SIde-note: I think this also makes the code is a lot clearer!

Michaelvll

Thanks for the quick fix @iojw! These changes can potentially reduce the overhead of our system for launching tasks on multi-node clusters. Left several comments.

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

iojw · 2022-11-22T18:03:19Z

@Michaelvll I can run the smoke tests for AWS and Azure but not GCP since I don't have GCP credentials - could you help me run them for GCP?

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2022-11-22T21:30:49Z

We got several smoke tests issues for incorrect cluster status refresh and some internal/external IP length mismatch problem:
https://gist.github.com/Michaelvll/c7f3891653ded20f0aa3cd41562bc285
https://gist.github.com/Michaelvll/1bfeeb379c355f7fde3baeab1a04f90d
Let's hold this PR from merging until we fix the test errors. : )

Michaelvll · 2022-11-29T22:40:42Z

Thanks for the quick fix @iojw! Just tried the sky status that has an existing cluster, and it worked. I am running the smoke test for this and the tests/backward_comaptibility_tests.sh.

Tested (c5ade55):

tests/run_smoke_tests.sh
tests/backward_comaptibility_tests.sh

sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

sky/backends/cloud_vm_ray_backend.py

Michaelvll

Thank you for the awesome work @iojw! It passed all the tests on my side. I think it is good to go.

sky/backends/cloud_vm_ray_backend.py

iojw · 2022-11-30T22:10:07Z

Thank you @Michaelvll @concretevitamin for the reviews!

* Messy WIP * Fixes two more yamls * Improve log UX and ensure stableness * Remove print statement * Remove task name from logs * Fix name for single-node tasks * Update var names and comments for clarity * Update logic for single and multi-node clusters * Cache stable cluster IP list in ResourceHandle * Properly cache and invalidate stable list * Add back SKYPILOT_NODE_IPS * Update log file name * Refactor backend to use cached stable IP list * Fix spot test * Fix formatting * Refactor ResourceHandle * Fixes for correctness * Remove unneeded num_nodes arg * Fix _gang_schedule_ray_up * Ensure stable IP list is cached * Formatting fixes * Refactor updating stable IPs to be part of handle * Merge max attempts constant * Fix ordering for setting TPU name * Fix bugs and clean up code * Fix backwards compatibility * Fix bug with old autostopped clusters * Fix comment * Fix assertion statement * Update assertion message Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Fix linting * Fix retrieving IPs for TPU vm * Add optimization for updating IPs * Linting fix * Update comment Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* add cost tracking for clusters that handles launching, re-starting, getting status, stopping and downing clusters, but no auto-stopping * address Romil PR comments * address Zhanghao PR comments * fix nit * address more PR comments * address last wave of PR comments * sky * address fixing argument for requested resources and fixing spot tests for CI * address more PR comments * make tests resources a list to prevent errors * fix tests again * address PR comments, including adding fetchall to fix status one cluster only bug * fix PR comments * change progress bar interference on stop/down * add sky report instead of showing cost on other commands * address cost report PR comments * address more PR comments on sky report * [Core] Port ray 2.0.1 (#1133) * update ray node provider to 2.0.0 update patches Adapt to ray functions in 2.0.0 update azure-cli version for faster installation format [Onprem] Automatically install sky dependencies (#1116) * Remove root user, move ray cluster to admin * Automatically install sky dependencies * Fix admin alignment * Fix PR * Address romil's comments * F * Addressed Romil's comments Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207) * Add --retry-until-up flag for interactive nodes * Add --region flag for interactive nodes * Add --idle-minutes-to-autostop flag for interactive nodes * Add --zone flag for interactive nodes * Update help messages * Address nit Add all region option in catalog fetcher and speed up azure fetcher (#1204) * Port changes * format * add t2a exclusion back * fix A100 for GCP * fix aws fetching for p4de.24xlarge * Fill GPUInfo * fix * address part of comments * address comments * add test for A100 * patch GpuInfo * Add generation info * Add capabilities back to azure and fix aws * fix azure catalog * format * lint * remove zone from azure * fix azure * Add analyze for csv * update catalog analysis * format * backward compatible for azure_catalog * yapf * fix GCP catalog * fix A100-80GB * format * increase version number * only keep useful columns for aws * remove capabilities from azure * add az to AWS Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes" (#1220) Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207)" This reverts commit f06416d. [Storage] Add `StorageMode` to __init__ (#1223) * Add storage mode to __init__ * fix [Example] Minimal containerized app example (#1212) * Container example * parenthesis * Add explicit StorageMode * lint Fix Mac Version in Setup.py (#1224) Fix mac Reduce iops for aws instances (#1221) * set the default iops to be same as console for AWS * fix Revert "Reduce iops for aws instances" (#1229) Revert "Reduce iops for aws instances (#1221)" This reverts commit 29f1458. update back compat test * parent 06afd93 author Zhanghao Wu <zhanghao.wu@outlook.com> 1665364265 -0700 committer Zhanghao Wu <zhanghao.wu@outlook.com> 1665899898 -0700 parent 06afd93 author Zhanghao Wu <zhanghao.wu@outlook.com> 1665364265 -0700 committer Zhanghao Wu <zhanghao.wu@outlook.com> 1665899681 -0700 Support for autodown Change API to terminate fix flag address comment format Rename terminate to down add smoke test format fix syntax use gcp for autodown test fix smoke test fix smoke test address comments Switch back to terminate Change back to tear down Change to tear down fix comment * Fix rebase issue * address comments * address * fix setup.py * upgrade to 2.0.1 * Fix docs for ray version * Fix example * fix backward compatibility test * Fix onprem job submission * add steps for backward compat test * docs: Remove version from docs html titles. (#1303) Remove version from docs html titles. * Fix unnecessary ssh hanging issue on Ray (#851) * Fix ray hanging ssh issue * Fix * change the order back * Update node status after first attempt * Set `--rename-dir-limit` for gcsfuse to allow dir renames (#1296) Set rename_dir_lim for gcsfuse * Docs: polish `sky.Task` doc strings. (#1302) * WIP * Polish sky.Task doc strings. * docs: expose Task (a subset of methods); hide Dag. * Tweak Task method order; in docs display methods by source order. * CLI docs: tweak order; tweak `spot launch`. * Address comments. * Code block formatting. * [Launch/Backward Compatibility] Fix incorrect Ray YAML issue (#1287) * Fix incorrect Ray YAML issue * yapf * fix * comments * [Storage] add `--implicit-dirs` for gcsfuse (#1312) add --implicit-dirs * Improving README. (#1308) * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Port landing paras to docs index.rst. * [UX] Disable python output buffer by default (#1290) diable python output buffer * [Storage][Filemounts] Set relative dir root to workdir (#1315) Set relative dir root to workdir for file_mounts * Fix Sky Storage Delete more than 256 Items/folders + Bulk Deletion Tests (#1285) * fix * Add romil's suggestions * Add bulk deletion tests * ok * Fix * [Storage] Add lazy unmount flag (#1320) Add lazy unmount flag * [Core] Fix skylet checking (#1325) * Fix skylet checking * exclude grep * [UX] remove stacktrace for pipe and ssh info (#1324) * UX: remove stacktrace for pipe and ssh info * Add comment * Avoid ray output in the logs * format * revert ssh quiet option * [Dependency] Fix colorama dependency issue with awscli (#1323) * Fix colorama dependency issue with awscli * fix ux for storage delete * Add roadmap. (#1317) * Add roadmap. * Update ROADMAP.md * Fix SKY_NODE_RANK environment variable (#1291) * Add flag for retrieving internal node ips * Ensure SKY_NODE_RANK is 0 for head and stable * Clearer comment for get_internal_ips * Handle different num nodes correctly * Address PR comments * Address nits * [Spot] Fix race condition for spot logs (#1329) * Fix race condition for spot logs * fix * fix comment * address comments * add comment * Add TPU Pod to doc (#1318) * add pod to doc * Apply suggestions from code review Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * comments * comments * update bucket note * Apply suggestions from code review Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * update * update * fix * fix * comments * fix Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * [UX] Add environment variable `SKY_NUM_GPUS_PER_NODE` (#1337) * add SKY_NUM_GPUS_PER_NODE * increase multi-node progress timeout * pin torch version * add comment * address comment * fix smoke test * address comments * [Image] Fix blocking by unattended-upgrade (#1347) * Fix blocking by unattended-upgrade * adopt to gcp and azure * [Test/Azure] Fix the torch version in examples for smoke test and change the credential for Azure (#1330) * Upgrade images for three clouds * Fix cuda version * pin cuda version for torch * Fix torch version * fix comments * Fix azure provider * fix credential * revert back to previous azure image * switch back to cuda 11.3 for pytorch due to azure's image * fix torch installation * increase the multi-node timeout * Update sky/clouds/azure.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * revert aws image version * pin cu113 for huggingface * Add comment * format * Update sky/clouds/aws.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update sky/clouds/gcp.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * revert gcp image * Fix doc Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * [Docs] Reorganizing docs. (#1316) * Reorganizing docs. * V2. * Reorg + rewording * Address comments * Remove 'convenient' * Update `SKY_NODE_RANK` docs (#1350) * Add tip for node rank to docs * Update formatting * Indent fix. Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (v2) (#1297) * Add --region, --zone, --idle-minutes-to-autostop, and --retry-until-up for interactive nodes * Update user_requested_resources * Add --down for interactive nodes and refactor auto{stop,down} edge case * Refactor click options * Revert "Refactor click options" This reverts commit 10a06a9. * Fix TPU Pod (#1358) * fix pod * yapf * Minor fix for yapf warnings (#1362) * [Docs] Clarify Storage mounting details (#1365) * fix incorrect statements * fix incorrect statements * fix incorrect statements * Fix bugs in GCP A100 prices (#1368) * Fix GCP A100 price bugs * yapf * [Custom Image] Support tag for the images and global regions (#1366) * Support image tag for AWS * add gcp image support * address comments * fix * remove pandas warning * Add example for using ubuntu1804 * add ubuntu 1804 in the test * Enforce trying us regions first * format * address comments * address comments * Add docs and rename methods * Add fetch global regions for GCP * Add all regions for Azure * rename and add doc * remvoe accidently added folder * fix service_catalog * remove extra line * Address comments * mkdir for catalog path * increase waiting time in test * fix test recovery * format * [UX/Doc] Add disk size in resource display and a minor fix for the doc (#1371) Minor fix for docs and ux * [Onprem] Support for Different Type of GPUs + Small Bugfix (#1356) * Ok * Great suggestion from Zhanghao * fix * Update tutorial.rst * Pin `torch` in various examples to avoid cuda version issues. (#1378) * tutorial.rst: pin `torch` to avoid version issues. Tested: - Ran on both AWS and GCP. * Fixes two more yamls * [Env] SKYPILOT_JOB_ID for all tasks (#1377) * Add run id for normal job * add example for the run id * fix env_check * fix env_check * fix * address comments * Rename to SKYPILOT_JOB_ID * rename the controller's job id to avoid confusion * rename env variables * fix * [Core] Add support for detach setup (#1379) * Add support for async setup * Fix logging * Add test for async setup * add parens * fix * refactor a bit * Fix status * fix smoke test * rename * fix is_cluster_idle function * format * address comments * fix * Add setup failed * Fix failed setup * Add comment * Add comments * format * fix logs * format * address comments * Minor UX fix: `sky cancel` should not print stacktraces. (#1385) * Minor UX fix: `sky cancel` should not print stacktraces. * Wording fix. * exit 1 * [UX] Disable ssh connection sharing for setup (#1390) * Disable ssh connection sharing for setup * format * remove redundant * fix type hint * Docs: multi-node clarifications, and ssh into workers. (#1363) * Fixes #1338: add docs on logging into workers. * Fixes #1340 and fixes #1339. * Address comments * Reword. * Hint. * Fix Logging for `sky launch` on new machine (#1382) * ok * ok * Ok * ok * Unify methods * ok * fix * [Image] Support passing AMIs for different regions (#1384) * image dict in resources * fix * fix * add tests * add per region example * address comments * Fix checking * fix * fix smoke test * [LocalDockerBackend] Update `is_local_cluster` check for docker backend (#1396) Update is_local_cluster check for LocalDockerBackend * [Setup] unset CUDA_VISIBLE_DEVICES for detach setup (#1404) * unset CUDA_VISIBLE_DEVICES * add env check example * Add setting CUDA_VISIBLE_DEVICES test * fix * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * [Spot] Keep SKYPILOT_JOB_ID the same for the same spot job (#1400) * fix SKYPILOT_JOB_ID * Fix test * fix * format * Add SKYPILOT_JOB_ID to sky spot queue * nit * don't set job_id_env_var for spot controller task * address comments * Revert SKYPILOT_JOB_ID in spot queue * format * Change default value of task.envs to dict * [UX] fix the error for the first time `sky launch` (#1405) * fix ux * test * fix no public cloud * address comments * Fix logging * format * Remove the error type for CLI * yellow * fix * Fix logging * [Spot] Fix spot recovery for multi node (#1411) * Add cluster status check even job is RUNNING for multi-node * Disable autoscaler logs and fix finished when partially preempted * format * Add test * address comments * update * Add time * [Release] Fix pypi description (#1416) * Fix pypi description * fix * format * [Bug fix] head_ip extraction from Ray stdout (#1421) * Fix bug in head_ip extraction from Ray stdout after launching cluster by using regex to exactly match ip. * Remove unneeded comment. * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Run yapf and pylint Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Global Regions] Add data fetchers into wheel (#1406) * Add data fetchers into wheel * yapf * Fix gcp fetcher * Add check * exclude analyze.py * Link to blog on README and docs. (#1430) * [Spot] Let cancel interrupt the spot job (#1414) * Let cancel interrupt the job * Add test * Fix test * Cancel early * fix test * fix test * Fix exceptions * pass test * increase waiting time * address comments * add job id * remove 'auto' in ray.init * Revert "[Spot] Let cancel interrupt the spot job" (#1432) Revert "[Spot] Let cancel interrupt the spot job (#1414)" This reverts commit 3bbf4aa. * [AWS] Avoid key pair permission issue by using cloud-init for authorized keys (#1427) * Switch to UserData to add public key for AWS * fix * Avoid hardcoding username * Fix backward compatibility test * address comments * address comments * Minor spot logs fix: don't print job id not provided on spot launch. (#1434) Minor spot logs fix: don't print job id not provided. * [Catalog] Remove hardcoded A2 pricing URL & Fix a bug in A2 machine zones (#1426) * Update no 16xA100-40GB zones * [Catalog] Remove GCP A2 price URL & Fix GCP A100 zone issues * Add more type annotations * Minor * yapf * Do not add GCP URL prefix * Minor * Address comments * Address comment1 * Minor * Add comments about the case when a100.empty is True * Assert not duplicated * [Spot] Let cancel interrupt the spot job (#1414) (#1433) * Let cancel interrupt the job * Add test * Fix test * Cancel early * fix test * fix test * Fix exceptions * pass test * increase waiting time * address comments * add job id * remove 'auto' in ray.init * Fix serialization problem * refactor a bit * Fix * Add comments * format * pylint * revert a format change * Add docstr * Move ray.init * replace ray with multiprocess.Process * Add test for setup cancelation * Fix logging * Fix test * lint * Use SIGTERM instead * format * Change exception type * revert to KeyboardInterrupt * remove * Fix test * fix test * fix test * typo * [Usage] Robustify the user hash to avoid empty string (#1442) * Robustify the user hash to avoid empty string * fix * Check valid user hash with hexdecimal * format * fix * Add fallback * Add comment * lint * [Storage] Support multiple files in Storage (#1311) * Set rename_dir_lim for gcsfuse * Add support for list of sources for Storage * fix demo yaml * tests * lint * lint * test * add validation * address zhwu comments * add error on basename conflicts * use gsutil cp -n instead of gsutil rsync * lint * fix name * parallelize gsutil rsync * parallelize aws s3 rsync * lint * address comments * refactor * lint * address comments * update schema * Logging fixes. (#1452) * Logging fixes. * yapf * sys.exit(1) * [Storage] Fix copy monuts for file with s3 bucket url (#1457) * test file download with s3 * fix test * fix storage file mounts * format * remove mkdir for `make_sync_dir_command` * Print errors for GCP timeout. (#1454) * [autostop] Support restarting the autostop timer. (#1458) * [autostop] Support restarting the autostop timer. * Logging * Make each job submission call set_active_time_to_now(). * Fix test and pylint. * Fix comments. * Change tests; some fixes * logging remnant * remnant * [Spot] Make sure the cluster status is not None when showing (#1464) * Make sure the cluster status is not None when showing * Fix another potential issue with NoneType of handle * Add assert * fix * format * Address comments * Address comments * format * format * fix * fix * fix spot cancellation * format * Add a few small warnings to README and CONTRIBUTING. (#1422) * Add a couple small warnings to README and CONTRIBUTING. * Update README.md Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Hotfix for spot TPU pod recovery (#1470) * hotfix * comment * [Spot] Better spot logs (#1412) * Add cluster status check even job is RUNNING for multi-node * Disable autoscaler logs and fix finished when partially preempted * format * Add test * Better spot logging * Add logs * format * address comments * address comments part 2 * Finish the logging early * format * better logging * Address comments * Fix message * Address comments * Improve UX for logs to include SSH name and rank (#1380) * Messy WIP * Fixes two more yamls * Improve log UX and ensure stableness * Remove print statement * Remove task name from logs * Fix name for single-node tasks * Update var names and comments for clarity * Update logic for single and multi-node clusters * Cache stable cluster IP list in ResourceHandle * Properly cache and invalidate stable list * Add back SKYPILOT_NODE_IPS * Update log file name * Refactor backend to use cached stable IP list * Fix spot test * Fix formatting * Refactor ResourceHandle * Fixes for correctness * Remove unneeded num_nodes arg * Fix _gang_schedule_ray_up * Ensure stable IP list is cached * Formatting fixes * Refactor updating stable IPs to be part of handle * Merge max attempts constant * Fix ordering for setting TPU name * Fix bugs and clean up code * Fix backwards compatibility * Fix bug with old autostopped clusters * Fix comment * Fix assertion statement * Update assertion message Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Fix linting * Fix retrieving IPs for TPU vm * Add optimization for updating IPs * Linting fix * Update comment Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * add cost tracking for clusters that handles launching, re-starting, getting status, stopping and downing clusters, but no auto-stopping * fix some artifacts from rebase error * handle linting * make it cost-report * address PR changes for approval * last changes * address last changes * move around comments for sort * add cost_report func to init.all list Co-authored-by: Sumanth <sumanth@MacBook-Pro-5.local> Co-authored-by: Sumanth <sumanth@MacBook-Pro-5.attlocal.net> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Michael Luo <michael.luo@berkeley.edu> Co-authored-by: Isaac Ong <isaacong.jw@gmail.com> Co-authored-by: ewzeng <46831164+ewzeng@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Donny Greenberg <dongreenberg2@gmail.com>

concretevitamin and others added 3 commits November 4, 2022 09:08

Messy WIP

ed3cf06

Fixes two more yamls

ad61a1e

Improve log UX and ensure stableness

3d3da86

iojw requested review from concretevitamin and Michaelvll November 6, 2022 03:05

iojw added 2 commits November 5, 2022 22:20

Remove print statement

ead021b

Remove task name from logs

9d6aeb1

Fix name for single-node tasks

531171f

concretevitamin reviewed Nov 7, 2022

View reviewed changes

iojw added 3 commits November 6, 2022 20:04

Update var names and comments for clarity

4f49828

Update logic for single and multi-node clusters

fed1dd4

Cache stable cluster IP list in ResourceHandle

c7ae34f

iojw linked an issue Nov 7, 2022 that may be closed by this pull request

Debug UX: easily know SSH name or public IP from log outputs #1369

Closed

iojw added 4 commits November 7, 2022 14:39

Properly cache and invalidate stable list

31a9faa

Merge branch 'master' into worker-id

2e3523a

Merge branch 'master' into worker-id

a5ab207

Add back SKYPILOT_NODE_IPS

f7ec774

Michaelvll reviewed Nov 8, 2022

View reviewed changes

iojw added 3 commits November 8, 2022 13:42

Update log file name

965a1ab

Refactor backend to use cached stable IP list

ac6c4c9

Fix spot test

396eb09

Fix ordering for setting TPU name

20e6e5a

Michaelvll reviewed Nov 22, 2022

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

Michaelvll added the do not merge do not merge this PR now label Nov 22, 2022

iojw added 5 commits November 27, 2022 20:47

Fix bugs and clean up code

7a38ccd

Fix backwards compatibility

1ac6eb4

Fix bug with old autostopped clusters

a7b812c

Fix comment

28ad6ea

Merge branch 'master' into worker-id

621aff9

Fix assertion statement

2625788

Michaelvll reviewed Nov 29, 2022

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

iojw and others added 2 commits November 29, 2022 16:07

Update assertion message

2300b10

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

Fix linting

66986e9

Michaelvll reviewed Nov 30, 2022

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

iojw added 3 commits November 30, 2022 08:39

Fix retrieving IPs for TPU vm

3ca0e56

Add optimization for updating IPs

19228f6

Linting fix

c5ade55

Michaelvll approved these changes Nov 30, 2022

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

Michaelvll removed the do not merge do not merge this PR now label Nov 30, 2022

Update comment

26ead20

iojw merged commit e2cddf9 into master Nov 30, 2022

iojw deleted the worker-id branch November 30, 2022 22:14

infwinston mentioned this pull request Dec 1, 2022

[TPU] Can't launch a TPU Pod #1480

Closed

iojw mentioned this pull request Dec 1, 2022

Fix ResourceHandle semantics #1481

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve UX for logs to include SSH name and rank #1380

Improve UX for logs to include SSH name and rank #1380

iojw commented Nov 6, 2022 •

edited

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022 •

edited

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022 •

edited

iojw commented Nov 6, 2022

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022

concretevitamin commented Nov 7, 2022 •

edited

concretevitamin left a comment

iojw commented Nov 7, 2022

Michaelvll left a comment

iojw commented Nov 22, 2022

Michaelvll commented Nov 22, 2022

Michaelvll commented Nov 29, 2022 •

edited

Michaelvll left a comment

iojw commented Nov 30, 2022

Improve UX for logs to include SSH name and rank #1380

Improve UX for logs to include SSH name and rank #1380

Conversation

iojw commented Nov 6, 2022 • edited

Questions

Tested

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022 • edited

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022 • edited

iojw commented Nov 6, 2022

concretevitamin commented Nov 6, 2022

iojw commented Nov 6, 2022

concretevitamin commented Nov 7, 2022 • edited

concretevitamin left a comment

Choose a reason for hiding this comment

iojw commented Nov 7, 2022

Michaelvll left a comment

Choose a reason for hiding this comment

iojw commented Nov 22, 2022

Michaelvll commented Nov 22, 2022

Michaelvll commented Nov 29, 2022 • edited

Michaelvll left a comment

Choose a reason for hiding this comment

iojw commented Nov 30, 2022

iojw commented Nov 6, 2022 •

edited

iojw commented Nov 6, 2022 •

edited

iojw commented Nov 6, 2022 •

edited

concretevitamin commented Nov 7, 2022 •

edited

Michaelvll commented Nov 29, 2022 •

edited