[autoscaler] Enable creating multiple clusters in one resource group … #22997

HongW2019 · 2022-03-10T08:26:21Z

…on Azure

Why are these changes needed?

This PR is to fix #22996, then enables Ray autoscaler creating multiple clusters in the same resource group on Azure.

When Ray autoscaler decides whether to create or update a cluster, it will list/filter non-terminated node. On GCP and AWS, there is an extra filter condition about cluster name, but there misses such condition on Azure. So when there are existing cluster nodes, autoscaler will not create a new head node because it detect there is already a node with tag " TAG_RAY_NODE_KIND: "head"

Related issue number

Closes #22996

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…on Azure

HongW2019 · 2022-03-10T08:32:33Z

@gramhagen @richardliaw Please take a view, thanks a lot.

gramhagen · 2022-03-11T15:52:46Z

python/ray/autoscaler/_private/_azure/node_provider.py

+            >>> provider.non_terminated_nodes({TAG_RAY_NODE_KIND: "worker", TAG_RAY_CLUSTER_NAME: "cluster_name"})
            ["node-1", "node-2"]
        """
+        tag_cluster_name = {TAG_RAY_CLUSTER_NAME: self.cluster_name}
+        tag_filters.update(tag_cluster_name)


wouldn't this be better to move under _get_filtered_nodes? also I would just add a direct check for that tag vs. updating the tag_filters dict to avoid undesired side effects to the dictionary reference.

Thanks. I have moved it under _get_filtered_nodes, and also make further modifications to avoid undesired side effects to the dictionary reference.

python/ray/autoscaler/_private/_azure/node_provider.py

HongW2019 · 2022-03-18T02:34:52Z

@DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

ecm200 · 2022-03-23T11:29:47Z

Hi,

I have come across this issue of not being able to run multiple clusters in a single resource group.

What's the status of this PR, it looks like it's yet to be approved?

Ed.

HongW2019 · 2022-03-23T12:33:53Z

Hi,

I have come across this issue of not being able to run multiple clusters in a single resource group.

What's the status of this PR, it looks like it's yet to be approved?

Ed.

Yes, we need at least 1 approving review by reviewers with write access. So, @ijrsvt @DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

ecm200 · 2022-03-23T14:38:05Z

Hi,
I have come across this issue of not being able to run multiple clusters in a single resource group.
What's the status of this PR, it looks like it's yet to be approved?
Ed.

Yes, we need at least 1 approving review by reviewers with write access. So, @ijrsvt @DmitriGekhtman Could you please help review / approve this PR? Thanks a lot.

One additional issue I seem to have uncovered is the additional resources ray creates when bringing up a cluster on Azure. Due to issue with having multiple clusters in a resource group, I have attempted to create a second cluster, in a new resource group, and this has encountered problems as these additional resources have the same name.

The resources I am referring to are:

Managed Identity [ray-msi-user-identity]
Network Security Group [ray-nsg]
Vnet [ray-vnet]

By default these resources are created into a resource group, but given identical names, as above in the square brackets.

When creating a new cluster using ray up it fails due to permissions errors of the managed identity. It threw the following error:

Traceback (most recent call last):
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2269, in main
    return cli()
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1158, in up
    create_or_update_cluster(
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 276, in create_or_update_cluster
    get_or_create_head_node(
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 641, in get_or_create_head_node
    nodes = provider.non_terminated_nodes(head_node_tags)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 142, in non_terminated_nodes
    nodes = self._get_filtered_nodes(tag_filters=tag_filters)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 38, in wrapper
    return f(self, *args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in _get_filtered_nodes
    nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 83, in <listcomp>
    nodes = [self._extract_metadata(vm) for vm in filter(match_tags, vms)]
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 129, in __next__
    return next(self._page_iterator)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/core/paging.py", line 76, in __next__
    self._response = self._get_next(self.continuation_token)
  File "/home/azureuser/.conda/envs/py38_ray_v2.0.0dev/lib/python3.8/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 1096, in get_next
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed
Message: The client '86e9aa23-0f73-46fd-85e9-ee5821d743de' with object id '86e9aa23-0f73-46fd-85e9-ee5821d743de' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/3da03240-09fb-4d1e-8631-c507fcd3dd9b/resourceGroups/data-science-itc-ray-cluster-rg/providers/Microsoft.Compute' or the scope is invalid. If access was recently granted, please refresh your credentials.

On the checking the error more carefully, I was able to determine that the issue was that although a new managed identity was created in the new resource group, the identity that was being used for the creation of the resources was in fact the older identity created for my original cluster, in a different resource group. I was able to determine this by looking at the object id throwing the error and those of the managed identities. So it would appear that the first managed identity is being used, which does not the correct scope, as it is created with permissions only for the specified resource group.

gramhagen · 2022-03-23T14:46:47Z

yeah good point @ecm200. out of curiosity why do you want to deploy these resources into the same resource group?
trying to understand if we should enable re-use of these resources or the deployments should be completely independent.

ecm200 · 2022-03-23T14:56:55Z

yeah good point @ecm200. out of curiosity why do you want to deploy these resources into the same resource group? trying to understand if we should enable re-use of these resources or the deployments should be completely independent.

Hi @gramhagen,
Thanks for your reply.
It goes without saying, but we really love Ray, and welcome any chance to feed back, so thanks!

I can actually see value in having multiples clusters in a single resource group, or indeed being able to have multiple clusters in separate resource groups.

Right now, I am not able to do either.

If I try to deploy into the same resource group, then due to the above identified problem, execution of the creation of the second cluster using ray up config_cluster2.yaml simply results in the already running cluster being updated.

If I try to create a new cluster in a separate resource group, the issue I am encountering is one of permissions. It would appear that the managed identity that is created as part of the cluster creation process has fixed name, because it would appear to be hardwired in the JSON templates in the package files. In azure-config-template.json I can see that all the fixed names for the managed identity, the vnet and the network security group. Therefore, when computational resources are attempted to be created in the new resource group for the new cluster, it uses the previously created managed identity from the FIRST resource group, which doesn't have the scope to create the resources in the NEW resource group.

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.
Otherwise, could the resources be referenced by their unique ID rather than the name?

gramhagen · 2022-03-23T15:08:42Z

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.

yes, this is my question on the use case. we can implement it such that all resources have a unique string appended to them. I just wanted to better understand if the reason for having multiple cluster was to re-use the vnet so they could talk to each other or something like that? or it might just be a user is restricted from creating multiple resource groups and wants to have multiple independent experiments.

ecm200 · 2022-03-23T15:18:58Z

Is there a way these objects could have modified names which associate it with the resource group?
This would at least ensure that the name is unique.

yes, this is my question on the use case. we can implement it such that all resources have a unique string appended to them. I just wanted to better understand if the reason for having multiple cluster was to re-use the vnet so they could talk to each other or something like that? or it might just be a user is restricted from creating multiple resource groups and wants to have multiple independent experiments.

@gramhagen ok, I understand you angle.

From our point of view, the reason for wanting to create clusters in separate resource groups but on the same subscription, then one good use case for that is having a "production" cluster, which perhaps is serving more front end facing compute jobs.

However, we'd also like a "research and development" cluster, which can be modified as the research requires whilst not worrying about disrupting a production service.

Additionally, the multiple clusters per resource group would serve the "research and development" paradigm, as I can see a case for wanting to have multiple clusters for different projects etc.

I would think in the context of multiple clusters per resource group, then a single set of VNET / NSG / MI per resource group would suffice.

ecm200 · 2022-03-23T15:28:24Z

ecm200

Programmatically I would say that's probably easier as well, given all you need to know is the resource group name, and then you can reconstruct the resource names when they are needed.

I can see the way the IDs for the resources are provided are to search on a name, and of course, the issue comes when two resources of the same name have been created. For example, setting the NSG using resourceId('Microsoft.Network/networkSecurityGroups', 'ray-nsg').

So having unique names would appear to solve this issue.

ecm200 · 2022-03-31T15:14:01Z

Hi @gramhagen,

What's the status of these changes?

Any idea if they will make it into a release coming soon?
Would they at least be put into the v2.0.0dev branch?

Would be eager to try out this functionality as it's something we'd very much like to be able to do quite soon.

gramhagen · 2022-03-31T15:16:28Z

pretty swamped unfortunately, if you or @HongW2019 want to try adding a unique string I'll probably be able to take a look over the weekend. not sure when I'll be able to get to it though.

ecm200 · 2022-03-31T15:18:01Z

pretty swamped unfortunately, if you or @HongW2019 want to try adding a unique string I'll probably be able to take a look over the weekend. not sure when I'll be able to get to it though.

Happy to try and help if I get the time.

stale · 2022-05-01T01:17:18Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2022-06-24T04:34:43Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

DmitriGekhtman · 2022-07-05T16:37:58Z

@gramhagen do you have time to pick this one up again?
More users are running into this.
https://discuss.ray.io/t/what-makes-a-cluster-unique-in-the-resource-group/6710

gramhagen · 2022-07-06T13:09:37Z

yeah, i have some time today to test adding a unique id for each cluster deployment

ijrsvt

LGTM!

@gramhagen let me know when you have tested this :)

gramhagen · 2022-07-08T15:00:46Z

@DmitriGekhtman I suggest we close this and use #26392 as the support for multiple cluster per rg is more extensively supported there.

@HongW2019 thanks for the additions, I've incorporated them into the other PR!

…source group. Using unique id to set subnet random seed, change msi and vnet names, logging unique id, and adding it to filter vms in cluster. Example template files updated with comments. (#26392) Why are these changes needed? Adding support for deploying multiple clusters into the same azure resource group Changes: Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed) Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets) Updated yaml example files with new provider values and explanations Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates Related issue number Closes #22996 Supersedes #22997 Signed-off-by: Scott Graham <scgraham@microsoft.com> Signed-off-by: Scott Graham <scgraham@microsoft.com> Co-authored-by: Scott Graham <scgraham@microsoft.com>

[autoscaler] Enable creating multiple clusters in one resource group …

5410c10

…on Azure

gramhagen requested changes Mar 11, 2022

View reviewed changes

[autoscaler] Refine updating Azure tag_filters

3208b9a

gramhagen requested changes Mar 14, 2022

View reviewed changes

python/ray/autoscaler/_private/_azure/node_provider.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/_azure/node_provider.py Outdated Show resolved Hide resolved

refine the way to update tag_filters

f332deb

gramhagen approved these changes Mar 15, 2022

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 1, 2022

stale bot closed this Jun 24, 2022

wuisawesome mentioned this pull request Jul 5, 2022

[Bug] [Autoscaler] [Azure] Can't create multiple clusters on Azure #22996

Closed

2 tasks

wuisawesome reopened this Jul 5, 2022

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 5, 2022

ijrsvt approved these changes Jul 7, 2022

View reviewed changes

gramhagen mentioned this pull request Jul 8, 2022

[autoscaler] Add support for multiple cluster per Azure resource group #26392

Merged

7 tasks

DmitriGekhtman closed this Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Enable creating multiple clusters in one resource group … #22997

[autoscaler] Enable creating multiple clusters in one resource group … #22997

HongW2019 commented Mar 10, 2022 •

edited

HongW2019 commented Mar 10, 2022

gramhagen Mar 11, 2022

HongW2019 Mar 14, 2022

HongW2019 commented Mar 18, 2022

ecm200 commented Mar 23, 2022

HongW2019 commented Mar 23, 2022

ecm200 commented Mar 23, 2022

gramhagen commented Mar 23, 2022

ecm200 commented Mar 23, 2022 •

edited

gramhagen commented Mar 23, 2022

ecm200 commented Mar 23, 2022

ecm200 commented Mar 23, 2022

ecm200 commented Mar 31, 2022

gramhagen commented Mar 31, 2022

ecm200 commented Mar 31, 2022

stale bot commented May 1, 2022

stale bot commented Jun 24, 2022

DmitriGekhtman commented Jul 5, 2022

gramhagen commented Jul 6, 2022

ijrsvt left a comment

gramhagen commented Jul 8, 2022

[autoscaler] Enable creating multiple clusters in one resource group … #22997

[autoscaler] Enable creating multiple clusters in one resource group … #22997

Conversation

HongW2019 commented Mar 10, 2022 • edited

Why are these changes needed?

Related issue number

Checks

HongW2019 commented Mar 10, 2022

gramhagen Mar 11, 2022

Choose a reason for hiding this comment

HongW2019 Mar 14, 2022

Choose a reason for hiding this comment

HongW2019 commented Mar 18, 2022

ecm200 commented Mar 23, 2022

HongW2019 commented Mar 23, 2022

ecm200 commented Mar 23, 2022

gramhagen commented Mar 23, 2022

ecm200 commented Mar 23, 2022 • edited

gramhagen commented Mar 23, 2022

ecm200 commented Mar 23, 2022

ecm200 commented Mar 23, 2022

ecm200 commented Mar 31, 2022

gramhagen commented Mar 31, 2022

ecm200 commented Mar 31, 2022

stale bot commented May 1, 2022

stale bot commented Jun 24, 2022

DmitriGekhtman commented Jul 5, 2022

gramhagen commented Jul 6, 2022

ijrsvt left a comment

Choose a reason for hiding this comment

gramhagen commented Jul 8, 2022

HongW2019 commented Mar 10, 2022 •

edited

ecm200 commented Mar 23, 2022 •

edited