Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vSphere Provider] Fix vc conn timout issue #40516

Merged

Conversation

JingChen23
Copy link
Contributor

@JingChen23 JingChen23 commented Oct 20, 2023

Why are these changes needed?

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

Issue here is soap request session is timing out after 30 mins on the server side.

Increasing connectionPoolTimeout Value on the client side or setting it to -1 doesn't solve the issue. Tested it. Issue still persists.This is because connectionPooltimeout parameter only close idle connections from the client side, and if the client connection was idle longer than the timeout value, a new connection will be created automatically. Passing connectionPoolTimeout = -1 does not have any effect on the session timeout. -1 only means client will never close an idle connection. Also, it doesn't mean that the connection will be valid forever. A soap request session will be timeout after 30min on server side, regardless whether or not the client closes the connection.

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown. The stub starts off in the "unauthenticated" state, so it will call the loginMethod on the first invocation of a method. If a communication error is encountered, the stub will wait for retryDelay seconds and then try to call the method again. If the server throws an exception that is in the SESSION_EXCEPTIONS tuple, it will be caught and the stub will transition back into the "unauthenticated" state so that another login will be performed.

Modified the code to use SmartStubAdapter using below as reference: https://github.com/vmware/pyvmomi/blob/912e365564c927dc9201eb6f46262924dc1dd275/pyVmomi/SoapAdapter.py#L1530

https://developer.vmware.com/samples/6889/esxi_certtool?h=connectionPoolTimeout#code

vmware/pyvmomi#347

Test

After doing Ray up, left the env there for more than 30 mins, 1 hour and 18 hours respectively. Then delete one worker node VM from the vSphere client. Verified that a new worker node is created by autoscaler calling the vSphere API, which proves that the session is still alive.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Chen Jing <jingch@vmware.com>
@JingChen23 JingChen23 changed the title [vSphere Provider]Fix vc conn timout issue [vSphere Provider] Fix vc conn timout issue Oct 20, 2023
@architkulkarni architkulkarni self-assigned this Oct 20, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor questions, but otherwise looks good to me

if not obj:
raise RuntimeError(
f"Unexpected: cannot find vSphere object {vimtype} with moid: {moid}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What's the reason for the changes to this function?
  • Previously get_pyvmomi_obj_by_* could not return None (instead it would raise an exception), but now it can return None. Do the callers need to handle the case where it's None?

Copy link
Contributor Author

@JingChen23 JingChen23 Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think the reason is to simplify the function's logic. Actually when I do the cherry pick for the bug fix, I find that this file has been changed in another PR by some other teammate. So I also cherry-picked that commit.
  • Yes, I agree with you that we should never change the behaviour like this when no change to the callers of this function. So let me optimize the function to make sure it will throw an exception when things are not correct.

if not obj:
raise RuntimeError(
f"Unexpected: cannot find vSphere object {vimtype} with name: {name}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the same code changes needed to be made in two places; I still think it would be a bit better to unify the functions, maybe like def get_myvmomi_obj(self, vimtype, name=None, moid=None), but it's not strictly related to the bug fixed in this PR and doesn't need to block the PR.

Signed-off-by: Chen Jing <jingch@vmware.com>
@architkulkarni
Copy link
Contributor

Book linkcheck failure unrelated, this PR does not touch the docs.

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 23, 2023
@architkulkarni architkulkarni merged commit e060dc2 into ray-project:master Oct 23, 2023
48 of 49 checks passed
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request Oct 25, 2023
…roject#40516)

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
vitsai pushed a commit that referenced this pull request Oct 26, 2023
…sue and support GPU nodes (#40667)

* [Cluster launcher] [vSphere] Fix the bug that multiple worker types doesn't work (#40487)

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

* [cluster launcher] [vSphere Provider] Fix vc conn timout issue (#40516)

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>

* [cluster launcher] [vSphere Provider] Support GPU Ray nodes on vSphere (#40616)

This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>

* [cluster launcher] [vSphere] Do not fetch runtime-info of vm from cached_nodes (#40655)

Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data.
It should query by pyvmomi_sdk every time.

Signed-off-by: Chen Hui <huchen@vmware.com>

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Hui <huchen@vmware.com>
Co-authored-by: Chen Jing <jingch@vmware.com>
Co-authored-by: huchen2021 <85480625+huchen2021@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants