You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tried to run the resnet_app.py example with Resources(sky.GCP(), accelerator={'V100': 4}), the sky will try to connect to the instance that was already launched by another user with only 1 V100 GPU, and raise the following error (the f0eb6cb1 in the error message is the instance id launched by another user). The reason for it can be because we use the same cluster_name in the gcp-ray.yml for everyone.
Traceback (most recent call last):
File "/data/zhwu/miniconda3/envs/sky/bin/ray", line 8, in<module>sys.exit(main())
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1970, in main
returncli()
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 963, in up
create_or_update_cluster(
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 242, in create_or_update_cluster
get_or_create_head_node(config, config_file, no_restart, restart_only, yes,
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 634, in get_or_create_head_node
provider.terminate_node(head_node)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 36, in method_with_retries
return method(self, *args, **kwargs)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 162, in terminate_node
resource = self._get_resource_depending_on_node_name(node_id)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 85, in _get_resource_depending_on_node_name
return self.resources[GCPNodeType.name_to_type(node_name)]
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 125, in name_to_type
return GCPNodeType(name.split("-")[-1])
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 384, in __call__
return cls.__new__(cls, value)
File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 702, in __new__
raise ve_exc
ValueError: 'f0eb6cb1' is not a valid GCPNodeType
The text was updated successfully, but these errors were encountered:
Arguably this is a feature. On the naming, since we use same names, they will collide. On the user side, since GCP and Azure put users together, this means one can access others’ VMs.
We will “fix” this in the in flight CLI PR where cluster names can be auto generated.
When I tried to run the
resnet_app.py
example withResources(sky.GCP(), accelerator={'V100': 4})
, the sky will try to connect to the instance that was already launched by another user with only 1 V100 GPU, and raise the following error (thef0eb6cb1
in the error message is the instance id launched by another user). The reason for it can be because we use the samecluster_name
in thegcp-ray.yml
for everyone.The text was updated successfully, but these errors were encountered: