Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug: only deploy ovf to first host of cluster #42258

Merged
merged 1 commit into from Jan 9, 2024

Conversation

huchen2021
Copy link
Contributor

@huchen2021 huchen2021 commented Jan 9, 2024

Description

When deploy ovf to datastore which is on the second host,

provider:
    vsphere_config:
      datacenter: Datacenter
      frozen_vm:
        library_item:    18-ubuntu-2204-frozen-vm-1
        cluster: cluster-hs2-d0202
        datastore:   202-datastore2
        name:   ubuntu-2204-frozen-vm-2

it reports the following error:

2023-12-29 06:18:59,689	INFO vsphere_sdk_provider.py:431 -- Found an OVF template: 18-ubuntu-2204-frozen-vm-1 to deploy.
2023-12-29 06:19:03,823	ERROR vsphere_sdk_provider.py:461 -- OVF error: {category : INPUT, issues : None, name : DatastoreMappingParams.target_datastore, value : datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b, message : {id : com.vmware.ovfs.ovfs-main.ovfs.invalid_ovf_parameter, default_message : Invalid value for DatastoreMappingParams.target_datastore: datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b., args : ['DatastoreMappingParams.target_datastore', 'datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b'], params : None, localized : None}, error : None}
Traceback (most recent call last):
  File "/home/ecl/.conda/envs/py38/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1298, in up
    create_or_update_cluster(
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 317, in create_or_update_cluster
    get_or_create_head_node(
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 763, in get_or_create_head_node
    provider.create_node(head_node_config, head_node_tags, 1)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/vsphere/node_provider.py", line 131, in create_node
    created_nodes_dict = self._create_node(
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/vsphere/node_provider.py", line 356, in _create_node
    frozen_vm_obj = self.create_new_or_fetch_existing_frozen_vms(node_config)
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/vsphere/node_provider.py", line 334, in create_new_or_fetch_existing_frozen_vms
    frozen_vm_obj = self.create_frozen_vm_from_ovf(
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/vsphere/node_provider.py", line 289, in create_frozen_vm_from_ovf
    vm_name = self.get_vsphere_sdk_provider().deploy_ovf(
  File "/home/ecl/.conda/envs/py38/lib/python3.8/site-packages/ray/autoscaler/_private/vsphere/vsphere_sdk_provider.py", line 463, in deploy_ovf
    raise ValueError(
ValueError: OVF deployment failed for VM ubuntu-2204-frozen-vm-3, reason: {succeeded : False, resource_id : None, error : {errors : [OvfError(category=Category(string='INPUT'), issues=None, name='DatastoreMappingParams.target_datastore', value='datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b', message=LocalizableMessage(id='com.vmware.ovfs.ovfs-main.ovfs.invalid_ovf_parameter', default_message='Invalid value for DatastoreMappingParams.target_datastore: datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b.', args=['DatastoreMappingParams.target_datastore', 'datastore-16:a354121d-bd03-4b0a-8038-9379a78fa92b'], params=None, localized=None), error=None)], warnings : [], information : []}}

That's because we fetch first host from cluster, which is host1. But datastore is on host2.

Solution

It should choose the common host of datastore and cluster, which is host2 on this scenario.

Test

When deploy ovf to datastore which is on the second host, the frozen vm is also on the second host, which is expected.

image

The cluster is provisioned successfully.

image

image

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Chen Hui <huchen@vmware.com>
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ML test is failing in premerge and blocking merging this PR. I restarted the test, but if it keeps failing, you can try merging master again.

@architkulkarni architkulkarni merged commit 9d09b47 into ray-project:master Jan 9, 2024
9 checks passed
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024
…f cluster (ray-project#42258)

When deploy ovf to datastore which is on the second host, it errors. 
That's because we fetch first host from cluster, which is host1. But datastore is on host2.

Solution
It should choose the common host of datastore and cluster, which is host2 on this scenario.

Signed-off-by: Chen Hui <huchen@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants