GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

concretevitamin · 2023-04-08T17:23:39Z

I've tried a few days and finally got a set_trace inside need_ray_up(), tested that the re.search() there works. This doesn't test things end-to-end, however.

sky launch -y --gpus A100-80GB:8 --cloud gcp --down -c dbg -r

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll

Nice! Thanks for fixing this issue @concretevitamin!

Michaelvll · 2023-04-08T17:37:38Z

sky/backends/cloud_vm_ray_backend.py

+                    # https://github.com/skypilot-org/skypilot/issues/1797
+                    # The VM may be alive on console. In the inner provision
+                    # loop we have used retries to recover but failed. The
+                    # provision loop will terminate the potentially live VMs
+                    # and move onto the next zone. Since the VM may have been
+                    # provisioned in this zone, it doesn't seem right to block
+                    # the current zone.
+                    pass


Since we have done the retry in _gang_schedule_ray_up by adding that error in the need_ray_up, we should add the zone to the block list here, to avoid the outer failover loop go to that zone again.

IIRC, without -r the outer loop will do a single pass over all zones, so this zone will not be tried again. So blocking vs. not blocking has the same effect. With -r, it's desirable to retry this zone. Is this the case?

Our yield zones will not loop through all the zones in a region for spot, as we currently rely on the optimizer to generate optimization result for per-zone. That is to say the failover for zones will happen here, but not here.
However, since we have add the to_provision in the block list here, it should explain why it does not go to that zone again. I still prefer to add that to the block list here for consistency with the other parts. Wdyt?

Oh, that's right. I got caught up thinking _yield_zones() will go through the zones. Added now.

sky/backends/cloud_vm_ray_backend.py

Michaelvll

Thanks for fixing this! LGTM.

GPC/provisioner: Mitigate RESOURCE_NOT_FOUND error for A100s.

1767083

concretevitamin requested a review from Michaelvll April 8, 2023 17:23

Michaelvll reviewed Apr 8, 2023

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

concretevitamin changed the title ~~GPC/provisioner: Mitigate RESOURCE_NOT_FOUND error for A100s.~~ GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. Apr 8, 2023

Block zone.

116083d

Michaelvll approved these changes Apr 8, 2023

View reviewed changes

concretevitamin merged commit 15f01d6 into master Apr 8, 2023
15 checks passed

concretevitamin deleted the fix-gcp-not-found branch April 8, 2023 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

concretevitamin commented Apr 8, 2023

Michaelvll left a comment

Michaelvll Apr 8, 2023

concretevitamin Apr 8, 2023

Michaelvll Apr 8, 2023 •

edited

concretevitamin Apr 8, 2023

Michaelvll left a comment

GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

Conversation

concretevitamin commented Apr 8, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Apr 8, 2023

Choose a reason for hiding this comment

concretevitamin Apr 8, 2023

Choose a reason for hiding this comment

Michaelvll Apr 8, 2023 • edited

Choose a reason for hiding this comment

concretevitamin Apr 8, 2023

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Apr 8, 2023 •

edited