Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Enable creating multiple clusters in one resource group … #22997

Closed

Conversation

HongW2019
Copy link

@HongW2019 HongW2019 commented Mar 10, 2022

…on Azure

Why are these changes needed?

This PR is to fix #22996, then enables Ray autoscaler creating multiple clusters in the same resource group on Azure.

When Ray autoscaler decides whether to create or update a cluster, it will list/filter non-terminated node. On GCP and AWS, there is an extra filter condition about cluster name, but there misses such condition on Azure. So when there are existing cluster nodes, autoscaler will not create a new head node because it detect there is already a node with tag " TAG_RAY_NODE_KIND: "head"

Related issue number

Closes #22996

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@HongW2019
Copy link
Author

@gramhagen @richardliaw Please take a view, thanks a lot.

Comment on lines 139 to 143
>>> provider.non_terminated_nodes({TAG_RAY_NODE_KIND: "worker", TAG_RAY_CLUSTER_NAME: "cluster_name"})
["node-1", "node-2"]
"""
tag_cluster_name = {TAG_RAY_CLUSTER_NAME: self.cluster_name}
tag_filters.update(tag_cluster_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this be better to move under _get_filtered_nodes? also I would just add a direct check for that tag vs. updating the tag_filters dict to avoid undesired side effects to the dictionary reference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have moved it under _get_filtered_nodes, and also make further modifications to avoid undesired side effects to the dictionary reference.

@HongW2019
Copy link
Author

@DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

@ecm200
Copy link

ecm200 commented Mar 23, 2022

Hi,

I have come across this issue of not being able to run multiple clusters in a single resource group.

What's the status of this PR, it looks like it's yet to be approved?

Ed.

@HongW2019
Copy link
Author

Hi,

I have come across this issue of not being able to run multiple clusters in a single resource group.

What's the status of this PR, it looks like it's yet to be approved?

Ed.

Yes, we need at least 1 approving review by reviewers with write access. So, @ijrsvt @DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

@ecm200
Copy link

ecm200 commented Mar 23, 2022

Hi,
I have come across this issue of not being able to run multiple clusters in a single resource group.
What's the status of this PR, it looks like it's yet to be approved?
Ed.

Yes, we need at least 1 approving review by reviewers with write access. So, @ijrsvt @DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

One additional issue I seem to have uncovered is the additional resources ray creates when bringing up a cluster on Azure. Due to issue with having multiple clusters in a resource group, I have attempted to create a second cluster, in a new resource group, and this has encountered problems as these additional resources have the same name.

The resources I am referring to are:

  1. Managed Identity [ray-msi-user-identity]
  2. Network Security Group [ray-nsg]
  3. Vnet [ray-vnet]

By default these resources are created into a resource group, but given identical names, as above in the square brackets.

When creating a new cluster using ray up it fails due to permissions errors of the managed identity. It threw the following error:

Traceback (most recent call last):
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2269, in main
    return cli()
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1158, in up
    create_or_update_cluster(
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 276, in create_or_update_cluster
    get_or_create_head_node(
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 641, in get_or_create_head_node
    nodes = provider.non_terminated_nodes(head_node_tags)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 142, in non_terminated_nodes
    nodes = self._get_filtered_nodes(tag_filters=tag_filters)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 38, in wrapper
    return f(self, *args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in _get_filtered_nodes
    nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in <listcomp>
    nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 129, in __next__
    return next(self._page_iterator)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 76, in __next__
    self._response = self._get_next(self.continuation_token)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 1096, in get_next
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed
Message: The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials.

On the checking the error more carefully, I was able to determine that the issue was that although a new managed identity was created in the new resource group, the identity that was being used for the creation of the resources was in fact the older identity created for my original cluster, in a different resource group. I was able to determine this by looking at the object id throwing the error and those of the managed identities. So it would appear that the first managed identity is being used, which does not the correct scope, as it is created with permissions only for the specified resource group.

@gramhagen
Copy link
Contributor

yeah good point @ecm200. out of curiosity why do you want to deploy these resources into the same resource group?
trying to understand if we should enable re-use of these resources or the deployments should be completely independent.

@ecm200
Copy link

ecm200 commented Mar 23, 2022

yeah good point @ecm200. out of curiosity why do you want to deploy these resources into the same resource group? trying to understand if we should enable re-use of these resources or the deployments should be completely independent.

Hi @gramhagen,
Thanks for your reply.
It goes without saying, but we really love Ray, and welcome any chance to feed back, so thanks!

I can actually see value in having multiples clusters in a single resource group, or indeed being able to have multiple clusters in separate resource groups.

Right now, I am not able to do either.

If I try to deploy into the same resource group, then due to the above identified problem, execution of the creation of the second cluster using ray up config_cluster2.yaml simply results in the already running cluster being updated.

If I try to create a new cluster in a separate resource group, the issue I am encountering is one of permissions. It would appear that the managed identity that is created as part of the cluster creation process has fixed name, because it would appear to be hardwired in the JSON templates in the package files. In azure-config-template.json I can see that all the fixed names for the managed identity, the vnet and the network security group. Therefore, when computational resources are attempted to be created in the new resource group for the new cluster, it uses the previously created managed identity from the FIRST resource group, which doesn't have the scope to create the resources in the NEW resource group.

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.
Otherwise, could the resources be referenced by their unique ID rather than the name?

@gramhagen
Copy link
Contributor

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.

yes, this is my question on the use case. we can implement it such that all resources have a unique string appended to them. I just wanted to better understand if the reason for having multiple cluster was to re-use the vnet so they could talk to each other or something like that? or it might just be a user is restricted from creating multiple resource groups and wants to have multiple independent experiments.

@ecm200
Copy link

ecm200 commented Mar 23, 2022

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.

yes, this is my question on the use case. we can implement it such that all resources have a unique string appended to them. I just wanted to better understand if the reason for having multiple cluster was to re-use the vnet so they could talk to each other or something like that? or it might just be a user is restricted from creating multiple resource groups and wants to have multiple independent experiments.

@gramhagen ok, I understand you angle.

From our point of view, the reason for wanting to create clusters in separate resource groups but on the same subscription, then one good use case for that is having a "production" cluster, which perhaps is serving more front end facing compute jobs.

However, we'd also like a "research and development" cluster, which can be modified as the research requires whilst not worrying about disrupting a production service.

Additionally, the multiple clusters per resource group would serve the "research and development" paradigm, as I can see a case for wanting to have multiple clusters for different projects etc.

I would think in the context of multiple clusters per resource group, then a single set of VNET / NSG / MI per resource group would suffice.

@ecm200
Copy link

ecm200 commented Mar 23, 2022

ecm200

Programmatically I would say that's probably easier as well, given all you need to know is the resource group name, and then you can reconstruct the resource names when they are needed.

I can see the way the IDs for the resources are provided are to search on a name, and of course, the issue comes when two resources of the same name have been created. For example, setting the NSG using resourceId('Microsoft.Network/networkSecurityGroups', 'ray-nsg').

So having unique names would appear to solve this issue.

@ecm200
Copy link

ecm200 commented Mar 31, 2022

Hi @gramhagen,

What's the status of these changes?

Any idea if they will make it into a release coming soon?
Would they at least be put into the v2.0.0dev branch?

Would be eager to try out this functionality as it's something we'd very much like to be able to do quite soon.

@gramhagen
Copy link
Contributor

pretty swamped unfortunately, if you or @HongW2019 want to try adding a unique string I'll probably be able to take a look over the weekend. not sure when I'll be able to get to it though.

@ecm200
Copy link

ecm200 commented Mar 31, 2022

pretty swamped unfortunately, if you or @HongW2019 want to try adding a unique string I'll probably be able to take a look over the weekend. not sure when I'll be able to get to it though.

Happy to try and help if I get the time.

@stale
Copy link

stale bot commented May 1, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 1, 2022
@stale
Copy link

stale bot commented Jun 24, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@DmitriGekhtman
Copy link
Contributor

@gramhagen do you have time to pick this one up again?
More users are running into this.
https://discuss.ray.io/t/what-makes-a-cluster-unique-in-the-resource-group/6710

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 5, 2022
@gramhagen
Copy link
Contributor

yeah, i have some time today to test adding a unique id for each cluster deployment

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gramhagen let me know when you have tested this :)

@gramhagen
Copy link
Contributor

@DmitriGekhtman I suggest we close this and use #26392 as the support for multiple cluster per rg is more extensively supported there.

@HongW2019 thanks for the additions, I've incorporated them into the other PR!

wuisawesome pushed a commit that referenced this pull request Aug 17, 2022
…source group. Using unique id to set subnet random seed, change msi and vnet names, logging unique id, and adding it to filter vms in cluster. Example template files updated with comments. (#26392)

Why are these changes needed?
Adding support for deploying multiple clusters into the same azure resource group

Changes:

Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed)
Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster
Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets)
Updated yaml example files with new provider values and explanations
Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates
Related issue number
Closes #22996
Supersedes #22997

Signed-off-by: Scott Graham <scgraham@microsoft.com>

Signed-off-by: Scott Graham <scgraham@microsoft.com>
Co-authored-by: Scott Graham <scgraham@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [Autoscaler] [Azure] Can't create multiple clusters on Azure
6 participants