Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Nov 19, 2020

Stack from ghstack:

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over file which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable RPC_INIT_WITH_TCP that allows us to run any RPC test with TCP initialization.

To avoid port collisions, we use common.find_free_port().
Differential Revision: D25085458

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with [todo]. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 19, 2020
rohan-varma added a commit that referenced this pull request Nov 19, 2020
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with [todo]. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

ghstack-source-id: 117107423
Pull Request resolved: #48248
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 19, 2020
Pull Request resolved: #48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117107619

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
use_tcp_init = os.environ.get("RPC_INIT_WITH_TCP", None)
if use_tcp_init== "1":
os.environ["MASTER_ADDR"] = '127.0.0.1'
os.environ["MASTER_PORT"] = str(find_free_port())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this actually may not be enough to prevent port collisions if there are multiple tests running in parallel? Test 1 could get port X, but before it creates the store and binds the port, Test 2 could also get port X and we would still have the race. If my understanding is correct, we probably need to write currently in use ports to a shared/locked data structure to completely avoid races?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This probably won't be an issue anymore now that we are not enabling it for all tests)

@dr-ci
Copy link

dr-ci bot commented Nov 19, 2020

💊 CI failures summary and remediations

As of commit a8f9d8b (more details on the Dr. CI page):


  • 1/3 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 2/3 broken upstream at merge base ccd20e9 since Nov 30

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 30 times.

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
@rohan-varma rohan-varma changed the title [WIP] Run RPC tests with TCP init in addition to file Run RPC tests with TCP init in addition to file Nov 19, 2020
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
@rohan-varma rohan-varma requested a review from lw November 19, 2020 10:36
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 19, 2020
Pull Request resolved: #48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117112750

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
@lw
Copy link
Contributor

lw commented Nov 19, 2020

My main comment on this is whether we need to re-run the entire RPC suite to check for differences between file-based and TCP-based init. The suite already takes quite a while to run, and once we'll add GPU tests and tests for other backends (like shared memory) it might lead to a "combinatorial explosion" of the time it takes to run it.

Would there be a way to replicate what RPC does during its init, and add that as a test case for the Stores? I'll also take a look at the issue you referenced in the comments to get more context.

@rohan-varma
Copy link
Contributor Author

rohan-varma commented Nov 19, 2020

@lw Thanks for bringing that up, my line of thinking was that it could be worth it to run with TCP init as this change already caught ~4 tests that are broken with TCP init but fine with file init.

We could always leave it as a flag and developers would have the option to set the flag to run with TCP init, but this probably doesn't eliminate the chance of checking in changes that may not work with TCP init. By running it in CI, there's probably a higher chance of preventing these bugs. If we're not sure about this, maybe we can just keep it as an env variable for now and not land the code that adds it as part of CI (at least in OSS, I think it's fine to run these internally)?

Also for clarification, for the increased time to run RPC test suite, is the concern mainly from developer time/productivity perspective or CI env time/cost perspective (or both)?

For context around shared memory, that will be a transport built into TensorPipe right? Would we need to run all RPC tests under that transport or would it be sufficient to test that at the TensorPipe level?

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 20, 2020
Pull Request resolved: #48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds testing with TCP init to the rpc tests. We disable/modify a few tests that would be broken under TCP initialization and will fix them in follow up PRs.

Running the tests with TCP init can be controlled with the env var setting `RPC_INIT_WITH_TCP="1"`. In OSS, we add a custom handler to run tests with the file init and TCP init in CI. OSS tests can be run with `python test/run_test.py --verbose -i distributed/rpc/test_process_group_agent`. See internal test plan for internal details on testing with TCP init.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117209436

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!
@lw
Copy link
Contributor

lw commented Nov 20, 2020

is the concern mainly from developer time/productivity perspective or CI env time/cost perspective (or both)?

I was mainly thinking in terms of time-to-signal for developers.

To ground this discussion in data, I looked at the CI logs of a recent job from master and from this PR:

It's just a single data point but it suggests that this PR would cause every PR to take 25 minutes longer to receive the green light from the CI. This may slow down developers (or encourage to land without waiting for CI).

Let me just clarify that I think we should absolutely run some tests with TCP. As you said, even just using it in our current tests revealed a bunch of errors that we should have caught earlier. However, those tests that fail do so because they're doing some "unusual" operations around initing and shutting down. My point is that tests of that type are the only ones that should be enrolled in TCP init, as they're the ones most likely to reveal something interesting. The vast majority of RPC tests do a "standard" init and then just test a wide variety of RPC-only features. It should be enough to run only one test of that type, as they all have effectively the same behavior when it comes to file-vs-TCP init.

Do you think we could take a pass over the RPC tests and try to select the ones we think would stress "unusual" code paths in init and tag them with a decorator, and then run only those with the TCP store?

For context around shared memory, that will be a transport built into TensorPipe right? Would we need to run all RPC tests under that transport or would it be sufficient to test that at the TensorPipe level?

We didn't figure it out yet, but indeed if we were to run the entire RPC suite with the shared memory transports we'd also greatly increase the total PyTorch CI test time (by ~13.5 minutes, as only the TensorPipe suite needs to be repeated). This is certainly not great and ideally we should do as you say and test this transport in isolation within TensorPipe and then just "trust" that it works in PyTorch, without re-testing it. Unfortunately for now the PyTorch test suite is much more comprehensive than TensorPipe's own one... :/

@rohan-varma
Copy link
Contributor Author

Do you think we could take a pass over the RPC tests and try to select the ones we think would stress "unusual" code paths in init and tag them with a decorator, and then run only those with the TCP store?

Sounds good, so I guess the proposal is the following:

  • Don't run all tests with TCP init in addition to file in CI
  • Add a decorator to run certain tests with the TCP init as well

@rohan-varma rohan-varma changed the title Run RPC tests with TCP init in addition to file Add an option to run RPC tests with TCP init Nov 27, 2020
We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. 

To avoid port collisions, we use `common.find_free_port()`.
Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 29, 2020
Pull Request resolved: #48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117428756

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)
@codecov
Copy link

codecov bot commented Nov 29, 2020

Codecov Report

Merging #48248 (21e9af6) into gh/rohan-varma/201/base (ccd20e9) will increase coverage by 0.31%.
The diff coverage is 31.74%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/201/base   #48248      +/-   ##
===========================================================
+ Coverage                    80.46%   80.77%   +0.31%     
===========================================================
  Files                         1855     1855              
  Lines                       200297   200229      -68     
===========================================================
+ Hits                        161160   161741     +581     
+ Misses                       39137    38488     -649     

@rohan-varma
Copy link
Contributor Author

@lw Updated the PR to just add the flag as an environment variable which will control whether the test runs with TCP or not, so that we can just run then under tcp init on an ad-hoc basis. I disabled a couple of the tests that don't work under it which we should fix.

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm fine with landing it as it is. It would be nice if we could enable some of these tests to run with TCP init in the CI at a later point.

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization. 

To avoid port collisions, we use `common.find_free_port()`.
Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25085458/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 1, 2020
Pull Request resolved: #48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117553039

Differential Revision: [D25085458](https://our.internmc.facebook.com/intern/diff/D25085458/)
@rohan-varma
Copy link
Contributor Author

@lw Sounds good, there are 2 tests right now that probably should run with TCP init but don't work, I'll check and see if there are any additional.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8b2ca28.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/201/head branch December 5, 2020 15:30
facebook-github-bot pushed a commit that referenced this pull request Jan 22, 2021
Summary:
We added this option in #48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc.

Pull Request resolved: #50861

Reviewed By: mrshenli

Differential Revision: D26014505

Pulled By: rohan-varma

fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants