Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid attempting to connect to GCP instance launched by other users #71

Closed
Michaelvll opened this issue Nov 29, 2021 · 2 comments
Closed

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Nov 29, 2021

When I tried to run the resnet_app.py example with Resources(sky.GCP(), accelerator={'V100': 4}), the sky will try to connect to the instance that was already launched by another user with only 1 V100 GPU, and raise the following error (the f0eb6cb1 in the error message is the instance id launched by another user). The reason for it can be because we use the same cluster_name in the gcp-ray.yml for everyone.

Traceback (most recent call last):
  File "/data/zhwu/miniconda3/envs/sky/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1970, in main
    return cli()
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 963, in up
    create_or_update_cluster(
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 242, in create_or_update_cluster
    get_or_create_head_node(config, config_file, no_restart, restart_only, yes,
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 634, in get_or_create_head_node
    provider.terminate_node(head_node)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 36, in method_with_retries
    return method(self, *args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 162, in terminate_node
    resource = self._get_resource_depending_on_node_name(node_id)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 85, in _get_resource_depending_on_node_name
    return self.resources[GCPNodeType.name_to_type(node_name)]
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 125, in name_to_type
    return GCPNodeType(name.split("-")[-1])
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 384, in __call__
    return cls.__new__(cls, value)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 702, in __new__
    raise ve_exc
ValueError: 'f0eb6cb1' is not a valid GCPNodeType
@concretevitamin
Copy link
Collaborator

Two quick thoughts:

  1. Arguably this is a feature. On the naming, since we use same names, they will collide. On the user side, since GCP and Azure put users together, this means one can access others’ VMs.

  2. We will “fix” this in the in flight CLI PR where cluster names can be auto generated.

@Michaelvll
Copy link
Collaborator Author

This is now solved by the CLI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants