-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add an option to run RPC tests with TCP init #48248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with [todo]. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with [todo]. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! ghstack-source-id: 117107423 Pull Request resolved: #48248
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
Pull Request resolved: #48248 We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. ghstack-source-id: 117107619 Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
use_tcp_init = os.environ.get("RPC_INIT_WITH_TCP", None) | ||
if use_tcp_init== "1": | ||
os.environ["MASTER_ADDR"] = '127.0.0.1' | ||
os.environ["MASTER_PORT"] = str(find_free_port()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this actually may not be enough to prevent port collisions if there are multiple tests running in parallel? Test 1 could get port X, but before it creates the store and binds the port, Test 2 could also get port X and we would still have the race. If my understanding is correct, we probably need to write currently in use ports to a shared/locked data structure to completely avoid races?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This probably won't be an issue anymore now that we are not enabling it for all tests)
💊 CI failures summary and remediationsAs of commit a8f9d8b (more details on the Dr. CI page):
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
Pull Request resolved: #48248 We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. ghstack-source-id: 117112750 Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
My main comment on this is whether we need to re-run the entire RPC suite to check for differences between file-based and TCP-based init. The suite already takes quite a while to run, and once we'll add GPU tests and tests for other backends (like shared memory) it might lead to a "combinatorial explosion" of the time it takes to run it. Would there be a way to replicate what RPC does during its init, and add that as a test case for the Stores? I'll also take a look at the issue you referenced in the comments to get more context. |
@lw Thanks for bringing that up, my line of thinking was that it could be worth it to run with TCP init as this change already caught ~4 tests that are broken with TCP init but fine with file init. We could always leave it as a flag and developers would have the option to set the flag to run with TCP init, but this probably doesn't eliminate the chance of checking in changes that may not work with TCP init. By running it in CI, there's probably a higher chance of preventing these bugs. If we're not sure about this, maybe we can just keep it as an env variable for now and not land the code that adds it as part of CI (at least in OSS, I think it's fine to run these internally)? Also for clarification, for the increased time to run RPC test suite, is the concern mainly from developer time/productivity perspective or CI env time/cost perspective (or both)? For context around shared memory, that will be a transport built into TensorPipe right? Would we need to run all RPC tests under that transport or would it be sufficient to test that at the TensorPipe level? |
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
Pull Request resolved: #48248 We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs. Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init. To avoid port collisions, we use `common.find_free_port()`. ghstack-source-id: 117209436 Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
I was mainly thinking in terms of time-to-signal for developers. To ground this discussion in data, I looked at the CI logs of a recent job from master and from this PR:
It's just a single data point but it suggests that this PR would cause every PR to take 25 minutes longer to receive the green light from the CI. This may slow down developers (or encourage to land without waiting for CI). Let me just clarify that I think we should absolutely run some tests with TCP. As you said, even just using it in our current tests revealed a bunch of errors that we should have caught earlier. However, those tests that fail do so because they're doing some "unusual" operations around initing and shutting down. My point is that tests of that type are the only ones that should be enrolled in TCP init, as they're the ones most likely to reveal something interesting. The vast majority of RPC tests do a "standard" init and then just test a wide variety of RPC-only features. It should be enough to run only one test of that type, as they all have effectively the same behavior when it comes to file-vs-TCP init. Do you think we could take a pass over the RPC tests and try to select the ones we think would stress "unusual" code paths in init and tag them with a decorator, and then run only those with the TCP store?
We didn't figure it out yet, but indeed if we were to run the entire RPC suite with the shared memory transports we'd also greatly increase the total PyTorch CI test time (by ~13.5 minutes, as only the TensorPipe suite needs to be repeated). This is certainly not great and ideally we should do as you say and test this transport in isolation within TensorPipe and then just "trust" that it works in PyTorch, without re-testing it. Unfortunately for now the PyTorch test suite is much more comprehensive than TensorPipe's own one... :/ |
Sounds good, so I guess the proposal is the following:
|
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
Pull Request resolved: #48248 We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. To avoid port collisions, we use `common.find_free_port()`. ghstack-source-id: 117428756 Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/201/base #48248 +/- ##
===========================================================
+ Coverage 80.46% 80.77% +0.31%
===========================================================
Files 1855 1855
Lines 200297 200229 -68
===========================================================
+ Hits 161160 161741 +581
+ Misses 39137 38488 -649 |
@lw Updated the PR to just add the flag as an environment variable which will control whether the test runs with TCP or not, so that we can just run then under tcp init on an ad-hoc basis. I disabled a couple of the tests that don't work under it which we should fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm fine with landing it as it is. It would be nice if we could enable some of these tests to run with TCP init in the CI at a later point.
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. To avoid port collisions, we use `common.find_free_port()`. Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)! [ghstack-poisoned]
Pull Request resolved: #48248 We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios). Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. To avoid port collisions, we use `common.find_free_port()`. ghstack-source-id: 117553039 Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)
@lw Sounds good, there are 2 tests right now that probably should run with TCP init but don't work, I'll check and see if there are any additional. |
This pull request has been merged in 8b2ca28. |
Summary: We added this option in #48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc. Pull Request resolved: #50861 Reviewed By: mrshenli Differential Revision: D26014505 Pulled By: rohan-varma fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
Stack from ghstack:
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over
file
which is the init that we use in our test in multi-machine scenarios).Due to this motivation, this PR adds an environment variable
RPC_INIT_WITH_TCP
that allows us to run any RPC test with TCP initialization.To avoid port collisions, we use
common.find_free_port()
.Differential Revision: D25085458
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!