New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Enable creating multiple clusters in one resource group … #22997
[autoscaler] Enable creating multiple clusters in one resource group … #22997
Conversation
@gramhagen @richardliaw Please take a view, thanks a lot. |
>>> provider.non_terminated_nodes({TAG_RAY_NODE_KIND: "worker", TAG_RAY_CLUSTER_NAME: "cluster_name"}) | ||
["node-1", "node-2"] | ||
""" | ||
tag_cluster_name = {TAG_RAY_CLUSTER_NAME: self.cluster_name} | ||
tag_filters.update(tag_cluster_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this be better to move under _get_filtered_nodes? also I would just add a direct check for that tag vs. updating the tag_filters dict to avoid undesired side effects to the dictionary reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have moved it under _get_filtered_nodes
, and also make further modifications to avoid undesired side effects to the dictionary reference.
@DmitriGekhtman Could you please help review / approve this PR? Thanks a lot. |
Hi, I have come across this issue of not being able to run multiple clusters in a single resource group. What's the status of this PR, it looks like it's yet to be approved? Ed. |
Yes, we need at least 1 approving review by reviewers with write access. So, @ijrsvt @DmitriGekhtman Could you please help review / approve this PR? Thanks a lot. |
One additional issue I seem to have uncovered is the additional resources The resources I am referring to are:
By default these resources are created into a resource group, but given identical names, as above in the square brackets. When creating a new cluster using Traceback (most recent call last):
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2269, in main
return cli()
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1158, in up
create_or_update_cluster(
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 276, in create_or_update_cluster
get_or_create_head_node(
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 641, in get_or_create_head_node
nodes = provider.non_terminated_nodes(head_node_tags)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 142, in non_terminated_nodes
nodes = self._get_filtered_nodes(tag_filters=tag_filters)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 38, in wrapper
return f(self, *args, **kwargs)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in _get_filtered_nodes
nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in <listcomp>
nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 129, in __next__
return next(self._page_iterator)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 76, in __next__
self._response = self._get_next(self.continuation_token)
File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 1096, in get_next
raise HttpResponseError(response=response, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed
Message: The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials. On the checking the error more carefully, I was able to determine that the issue was that although a new managed identity was created in the new resource group, the identity that was being used for the creation of the resources was in fact the older identity created for my original cluster, in a different resource group. I was able to determine this by looking at the |
yeah good point @ecm200. out of curiosity why do you want to deploy these resources into the same resource group? |
Hi @gramhagen, I can actually see value in having multiples clusters in a single resource group, or indeed being able to have multiple clusters in separate resource groups. Right now, I am not able to do either. If I try to deploy into the same resource group, then due to the above identified problem, execution of the creation of the second cluster using If I try to create a new cluster in a separate resource group, the issue I am encountering is one of permissions. It would appear that the managed identity that is created as part of the cluster creation process has fixed name, because it would appear to be hardwired in the Is there a way these objects could have modified names which associate it with the resource group? |
yes, this is my question on the use case. we can implement it such that all resources have a unique string appended to them. I just wanted to better understand if the reason for having multiple cluster was to re-use the vnet so they could talk to each other or something like that? or it might just be a user is restricted from creating multiple resource groups and wants to have multiple independent experiments. |
@gramhagen ok, I understand you angle. From our point of view, the reason for wanting to create clusters in separate resource groups but on the same subscription, then one good use case for that is having a "production" cluster, which perhaps is serving more front end facing compute jobs. However, we'd also like a "research and development" cluster, which can be modified as the research requires whilst not worrying about disrupting a production service. Additionally, the multiple clusters per resource group would serve the "research and development" paradigm, as I can see a case for wanting to have multiple clusters for different projects etc. I would think in the context of multiple clusters per resource group, then a single set of VNET / NSG / MI per resource group would suffice. |
Programmatically I would say that's probably easier as well, given all you need to know is the resource group name, and then you can reconstruct the resource names when they are needed. I can see the way the IDs for the resources are provided are to search on a name, and of course, the issue comes when two resources of the same name have been created. For example, setting the NSG using So having unique names would appear to solve this issue. |
Hi @gramhagen, What's the status of these changes? Any idea if they will make it into a release coming soon? Would be eager to try out this functionality as it's something we'd very much like to be able to do quite soon. |
pretty swamped unfortunately, if you or @HongW2019 want to try adding a unique string I'll probably be able to take a look over the weekend. not sure when I'll be able to get to it though. |
Happy to try and help if I get the time. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
@gramhagen do you have time to pick this one up again? |
yeah, i have some time today to test adding a unique id for each cluster deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@gramhagen let me know when you have tested this :)
@DmitriGekhtman I suggest we close this and use #26392 as the support for multiple cluster per rg is more extensively supported there. @HongW2019 thanks for the additions, I've incorporated them into the other PR! |
…source group. Using unique id to set subnet random seed, change msi and vnet names, logging unique id, and adding it to filter vms in cluster. Example template files updated with comments. (#26392) Why are these changes needed? Adding support for deploying multiple clusters into the same azure resource group Changes: Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed) Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets) Updated yaml example files with new provider values and explanations Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates Related issue number Closes #22996 Supersedes #22997 Signed-off-by: Scott Graham <scgraham@microsoft.com> Signed-off-by: Scott Graham <scgraham@microsoft.com> Co-authored-by: Scott Graham <scgraham@microsoft.com>
…on Azure
Why are these changes needed?
This PR is to fix #22996, then enables Ray autoscaler creating multiple clusters in the same resource group on Azure.
When Ray autoscaler decides whether to create or update a cluster, it will list/filter non-terminated node. On GCP and AWS, there is an extra filter condition about cluster name, but there misses such condition on Azure. So when there are existing cluster nodes, autoscaler will not create a new head node because it detect there is already a node with tag " TAG_RAY_NODE_KIND: "head"
Related issue number
Closes #22996
Checks
scripts/format.sh
to lint the changes in this PR.