Avoid attempting to connect to GCP instance launched by other users #71

Michaelvll · 2021-11-29T21:01:06Z

When I tried to run the resnet_app.py example with Resources(sky.GCP(), accelerator={'V100': 4}), the sky will try to connect to the instance that was already launched by another user with only 1 V100 GPU, and raise the following error (the f0eb6cb1 in the error message is the instance id launched by another user). The reason for it can be because we use the same cluster_name in the gcp-ray.yml for everyone.

Traceback (most recent call last):
  File "/data/zhwu/miniconda3/envs/sky/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1970, in main
    return cli()
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 963, in up
    create_or_update_cluster(
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 242, in create_or_update_cluster
    get_or_create_head_node(config, config_file, no_restart, restart_only, yes,
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 634, in get_or_create_head_node
    provider.terminate_node(head_node)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 36, in method_with_retries
    return method(self, *args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 162, in terminate_node
    resource = self._get_resource_depending_on_node_name(node_id)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 85, in _get_resource_depending_on_node_name
    return self.resources[GCPNodeType.name_to_type(node_name)]
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 125, in name_to_type
    return GCPNodeType(name.split("-")[-1])
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 384, in __call__
    return cls.__new__(cls, value)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 702, in __new__
    raise ve_exc
ValueError: 'f0eb6cb1' is not a valid GCPNodeType

The text was updated successfully, but these errors were encountered:

concretevitamin · 2021-12-01T15:55:47Z

Two quick thoughts:

Arguably this is a feature. On the naming, since we use same names, they will collide. On the user side, since GCP and Azure put users together, this means one can access others’ VMs.
We will “fix” this in the in flight CLI PR where cluster names can be auto generated.

Michaelvll · 2021-12-03T20:38:37Z

This is now solved by the CLI.

Michaelvll closed this as completed Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid attempting to connect to GCP instance launched by other users #71

Avoid attempting to connect to GCP instance launched by other users #71

Michaelvll commented Nov 29, 2021 •

edited

concretevitamin commented Dec 1, 2021

Michaelvll commented Dec 3, 2021

Avoid attempting to connect to GCP instance launched by other users #71

Avoid attempting to connect to GCP instance launched by other users #71

Comments

Michaelvll commented Nov 29, 2021 • edited

concretevitamin commented Dec 1, 2021

Michaelvll commented Dec 3, 2021

Michaelvll commented Nov 29, 2021 •

edited