[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

Michaelvll · 2024-02-04T23:04:15Z

Fixes #3080
ray driver somehow failed to print the output segmentation fault when the job fails due to that error. We try to make the UX better for this case, by checking the returncode with 139 and printing out the possible reason.

This PR also fixes #3116 the multi-node case where one of the worker fails, we should fail the entire job quickly, instead of waiting for the other workers.

num_nodes: 4

run: |
  if [ "$SKYPILOT_NODE_RANK" == "2" ]; then
      exit 1
  fi
  sleep 10000

This PR also makes the job submission a bit more efficient, by releasing the resources a earlier.

Tested (run the relevant ones):

concretevitamin

Thanks @Michaelvll. Could we also add a basic smoke tests to cover the multinode job failure case?

sky/backends/cloud_vm_ray_backend.py

…segmentation-fault

concretevitamin

Thanks @Michaelvll, LGTM.

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-02-08T07:36:17Z

@concretevitamin In order to support the case of having different setup on different nodes, I added a SKYPILOT_SETUP_NODE_IPS and SKYPILOT_SETUP_NODE_RANK env var for the setup phase (we don't use the same name as the env var in run section to avoid the confusion for why the same env vars are different in setup and run section). Fixes #2546 Wdyt?

concretevitamin

Mostly LGTM with some nits, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py

tests/test_smoke.py

concretevitamin · 2024-02-08T23:13:56Z

docs/source/running-jobs/environment-variables.rst

+  * - ``SKYPILOT_SETUP_NODE_IPS``
+    - A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address.
+    - 1.2.3.4
+


Add something like:

Since setup commands always run on all nodes of a cluster, SkyPilot ensures both of these environment variables (the ranks and the IP list) never change across multiple setups on the same cluster.

What about after restarting a cluster?

Good point! Added the sentence.

For restarted clusters, we only guarantee the NODE_RANK==0 should always be the head node, but the order of the other instances depend on the cloud's implementation for assigning the external IPs for the restarted nodes. We sort the nodes by the external IP, so if the external IPs change the node rank may be reordered for the workers:

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 2429 to 2430 in 30b39a6

stable_internal_external_ips = [internal_external_ips[0]] + sorted(

internal_external_ips[1:], key=lambda x: x[1])

I guess the order for the worker nodes should be fine, as long as we keep the head node to be the first one, as most heterogenity after restarting the cluster comes from the head node vs the workers, e.g., we only store the job state and skypilot logs on the head node.

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

…ypilot into print-segmentation-fault

docs/source/running-jobs/environment-variables.rst

Michaelvll added 3 commits February 4, 2024 21:42

[minor] Make job scheduling more efficient and output segment fault

851d20a

remove additional space

a67d3c7

fail early for multiple tasks

2855586

Michaelvll changed the title ~~[minor] Make job scheduling more efficient and output segment fault~~ [Core] Make multi-node job fail fast when one fails, and output segment fault Feb 7, 2024

Michaelvll linked an issue Feb 7, 2024 that may be closed by this pull request

[Core] Multi-node job does not fail fast when one of the worker fails #3116

Closed

Michaelvll added 2 commits February 7, 2024 18:09

fix

1273414

Fix

3b2ee96

Michaelvll requested a review from concretevitamin February 7, 2024 18:18

concretevitamin reviewed Feb 7, 2024

View reviewed changes

Michaelvll added 4 commits February 7, 2024 22:11

address comments

3ae24a8

add smoke tests

4450cf7

Add setup IP and ranks

77004e4

Merge branch 'master' of github.com:skypilot-org/skypilot into print-…

f81a4a8

…segmentation-fault

concretevitamin approved these changes Feb 8, 2024

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

Michaelvll added 6 commits February 8, 2024 03:16

Fix returncodes order

61af4ca

Add todo

ca50739

Add comment

d824f78

Add comment back

dbe7103

fix returncodes

eb0b73d

remove print

6fefcea

Michaelvll added 2 commits February 8, 2024 07:39

Add todo in smoke test

a8e9c0b

Fix failed run yaml

9601abb

Michaelvll requested a review from concretevitamin February 8, 2024 23:13

concretevitamin reviewed Feb 8, 2024

View reviewed changes

Michaelvll and others added 5 commits February 9, 2024 00:07

Address comments

ffc6751

use run_timestamp

30b39a6

Update sky/backends/cloud_vm_ray_backend.py

ce984bc

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

address comments

6811ca0

Merge branch 'print-segmentation-fault' of github.com:skypilot-org/sk…

f41b0de

…ypilot into print-segmentation-fault

Michaelvll added 2 commits February 9, 2024 00:35

mypy

92ab2f6

format

58531a8

concretevitamin approved these changes Feb 9, 2024

View reviewed changes

docs/source/running-jobs/environment-variables.rst Outdated Show resolved Hide resolved

address comments

7a04cb5

Michaelvll merged commit b042741 into master Feb 9, 2024
19 checks passed

Michaelvll deleted the print-segmentation-fault branch February 9, 2024 08:20

Michaelvll mentioned this pull request Apr 22, 2024

[Core] Fix jobs longer than 12 days #3460

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

Michaelvll commented Feb 4, 2024 •

edited

concretevitamin left a comment

concretevitamin left a comment

Michaelvll commented Feb 8, 2024 •

edited

concretevitamin left a comment

concretevitamin Feb 8, 2024

Michaelvll Feb 9, 2024 •

edited

	stable_internal_external_ips = [internal_external_ips[0]] + sorted(
	internal_external_ips[1:], key=lambda x: x[1])

[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

Conversation

Michaelvll commented Feb 4, 2024 • edited

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Feb 8, 2024 • edited

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Feb 8, 2024

Choose a reason for hiding this comment

Michaelvll Feb 9, 2024 • edited

Choose a reason for hiding this comment

Michaelvll commented Feb 4, 2024 •

edited

Michaelvll commented Feb 8, 2024 •

edited

Michaelvll Feb 9, 2024 •

edited