Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout for exchange_peer_info and fix for AM tests #994

Merged
merged 6 commits into from
Oct 11, 2023

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Oct 11, 2023

After some debugging of Distributed tests with UCX it was observed that sometimes exchange_peer_info hangs indefinitely, specifically when executing stream_recv on the client side. The causes for this is unknown but believed to be due to messages being lost if there's either multiple stream messages being transferred simultaneously among various endpoints or being lost due to the receiving end taking too long to launch stream_recv, see #509 for a similar issue related to stream API. By adding a timeout doesn't allow recovery, but at least allows a UCX-Py client to retry upon failure to establish the endpoint.

This change seems to resolve dask/distributed#5229, at least it isn't reproducible locally with this change.

Additionally do a roundtrip message transfer for `test_send_recv_am, which should resolve #797 and seems to be caused by checking received messages too early, before they are actually received by the listener. A roundtrip ensures the client receives the reply and thus prevents us from the checking for a transfer that didn't complete yet.

Ensure now also that the listener is closed before validating test_close_callback conditions, which was also flaky.

Finally, ensure we close the loop in test fixture, thus preventing DeprecationWarnings from pytest-asyncio
which currently closes unclosed event loop but will stop doing that in future releases.

Closes #797

After some debugging of Distributed tests with UCX it was observed that
sometimes `exchange_peer_info` hangs indefinitely, specifically when
executing `stream_recv` on the client side. The causes for this is
unknown but believed to be due to messages being lost if there's either
multiple stream messages being transferred simultaneously among various
endpoints or being lost due to the receiving end taking too long to
launch `stream_recv`, see rapidsai#509
for a similar issue related to stream API. By adding a timeout doesn't
allow recovery, but at least allows a UCX-Py client to retry upon
failure to establish the endpoint.
@pentschev
Copy link
Member Author

@charlesbluca could you plan on rebuilding gpuCI's docker image once this is merged in? It should fix dask/distributed#5229 , or it seems to do so at least.

@charlesbluca
Copy link
Member

Yup can rebuild once this is reflected in ucx-py nightlies

This should resolve rapidsai#797, which
seems to be caused by checking received messages too early, before they
are actually received by the listener. A roundtrip ensures the client
receives the reply and thus prevents us from the checking for a transfer
that didn't complete yet.
This change should prevent `DeprecationWarning`s from pytest-asyncio
which currently closes unclosed event loop but will stop doing that in
future releases.
@pentschev pentschev changed the title Add a timeout to exchange_peer_info Timeout for exchange_peer_info and fix for AM tests Oct 11, 2023
@pentschev pentschev requested a review from a team as a code owner October 11, 2023 15:12
@pentschev
Copy link
Member Author

@rapidsai/ucxx-python-codeowners could I please get your review here ASAP? I have the expectation that this may improve the situation with Dask-CUDA tests, thus getting it merged soon will let us verify that.

Copy link
Member

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; small non-blocking suggestion

ci/test_wheel.sh Outdated
@@ -14,4 +14,4 @@ cd tests
python -m pytest --cache-clear -vs .
cd ../ucp
# skipped test context: https://github.com/rapidsai/ucx-py/issues/797
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to remove this if this closes out #797:

Suggested change
# skipped test context: https://github.com/rapidsai/ucx-py/issues/797

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca , fixed in 6a8ebf2. It seems that you're not in the ucxpy-python-codeowners group though, was this always the case? I really thought you were in that group.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 438e5a2 into rapidsai:branch-0.35 Oct 11, 2023
33 checks passed
@pentschev pentschev deleted the exchange_peer_info-timeout branch October 12, 2023 08:07
pentschev added a commit to pentschev/distributed that referenced this pull request Oct 13, 2023
Some time ago `test_ucx_config_w_env_var` started failing
intermittently, and the causes were still unknown. After some
investigation it seems in certain cases exchanging UCX-Py peer
information causes some of the underlying communication calls to never
complete and thus cause a hang that can't be recovered from by
Distributed. With rapidsai/ucx-py#994, UCX-Py
now has a timeout on those calls that allow Distributed to catch and
retry establishing the connection, which seems to resolve the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible flaky AM test distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var flaky
4 participants