Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Make multi-node job fail fast when one fails, and output segment fault #3081

Merged
merged 25 commits into from
Feb 9, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 4, 2024

Fixes #3080
ray driver somehow failed to print the output segmentation fault when the job fails due to that error. We try to make the UX better for this case, by checking the returncode with 139 and printing out the possible reason.

This PR also fixes #3116 the multi-node case where one of the worker fails, we should fail the entire job quickly, instead of waiting for the other workers.

num_nodes: 4

run: |
  if [ "$SKYPILOT_NODE_RANK" == "2" ]; then
      exit 1
  fi
  sleep 10000

This PR also makes the job submission a bit more efficient, by releasing the resources a earlier.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_multi_node_failure
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll changed the title [minor] Make job scheduling more efficient and output segment fault [Core] Make multi-node job fail fast when one fails, and output segment fault Feb 7, 2024
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Could we also add a basic smoke tests to cover the multinode job failure case?

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, LGTM.

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Feb 8, 2024

@concretevitamin In order to support the case of having different setup on different nodes, I added a SKYPILOT_SETUP_NODE_IPS and SKYPILOT_SETUP_NODE_RANK env var for the setup phase (we don't use the same name as the env var in run section to avoid the confusion for why the same env vars are different in setup and run section). Fixes #2546 Wdyt?

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM with some nits, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
* - ``SKYPILOT_SETUP_NODE_IPS``
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address.
- 1.2.3.4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something like:

Since setup commands always run on all nodes of a cluster, SkyPilot ensures both of these environment variables (the ranks and the IP list) never change across multiple setups on the same cluster.

What about after restarting a cluster?

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added the sentence.

For restarted clusters, we only guarantee the NODE_RANK==0 should always be the head node, but the order of the other instances depend on the cloud's implementation for assigning the external IPs for the restarted nodes. We sort the nodes by the external IP, so if the external IPs change the node rank may be reordered for the workers:

stable_internal_external_ips = [internal_external_ips[0]] + sorted(
internal_external_ips[1:], key=lambda x: x[1])

I guess the order for the worker nodes should be fine, as long as we keep the head node to be the first one, as most heterogenity after restarting the cluster comes from the head node vs the workers, e.g., we only store the job state and skypilot logs on the head node.

@Michaelvll Michaelvll merged commit b042741 into master Feb 9, 2024
19 checks passed
@Michaelvll Michaelvll deleted the print-segmentation-fault branch February 9, 2024 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants