[RPC tests] Avoid decorators to skip tests #40819

lw · 2020-06-30T20:56:19Z

Stack from ghstack:

[RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too #42528 [RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too
[RPC tests] Generate test classes automatically #42527 [RPC tests] Generate test classes automatically
[RPC tests] Enroll TensorPipe in missing test suites #40823 [RPC tests] Enroll TensorPipe in missing test suites
[RPC tests] Remove global TEST_CONFIG #40822 [RPC tests] Remove global TEST_CONFIG
[RPC tests] Move some functions to methods of fixture #40821 [RPC tests] Move some functions to methods of fixture
[RPC tests] Make generic fixture an abstract base class #40820 [RPC tests] Make generic fixture an abstract base class
[RPC tests] Avoid decorators to skip tests #40819 [RPC tests] Avoid decorators to skip tests
[RPC tests] Merge process group tests into single entry point #40818 [RPC tests] Merge process group tests into single entry point
[RPC tests] Merge tests for faulty agent into single script #40817 [RPC tests] Merge tests for faulty agent into single script
[RPC tests] Merge TensorPipe tests into single entry point #40816 [RPC tests] Merge TensorPipe tests into single entry point

Summary of the entire stack:

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:

Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
These two ways lead to having two separate decorators (@requires_process_group_agent and @_skip_if_tensorpipe_agent) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
Thrift must override the TEST_CONFIG global variable before any other import (in order for the @requires_process_group_agent decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in @dist_init).
There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:

Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:

It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit

This diff removes the two decorators (@requires_process_group_agent and @_skip_if_tensorpipe_agent) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a TensorPipeAgentRpcTest class. So here we're doing the same for process group, by moving those tests to a ProcessGroupAgentRpcTest class.

Differential Revision: D22283179

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `@requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `@dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This diff removes the two decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class. Differential Revision: [D22283179](https://our.internmc.facebook.com/intern/diff/D22283179/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22283179/)! [ghstack-poisoned]

dr-ci · 2020-06-30T21:18:30Z

💊 CI failures summary and remediations

As of commit 37098c2 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-CircleCI failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 05 09:53:54 caused by: Connection refused (os error 111)

Aug 05 09:53:54 ++++ extract_trap_cmd 
Aug 05 09:53:54 ++++ printf '%s\n' '' 
Aug 05 09:53:54 +++ printf '%s\n' cleanup 
Aug 05 09:53:54 ++ trap -- ' 
Aug 05 09:53:54 cleanup' EXIT 
Aug 05 09:53:54 ++ [[ pytorch-linux-xenial-cuda10.1-cudnn7-ge_config_profiling-test != *pytorch-win-* ]] 
Aug 05 09:53:54 ++ which sccache 
Aug 05 09:53:54 ++ sccache --stop-server 
Aug 05 09:53:54 Stopping sccache server... 
Aug 05 09:53:54 error: couldn't connect to server 
Aug 05 09:53:54 caused by: Connection refused (os error 111) 
Aug 05 09:53:54 ++ true 
Aug 05 09:53:54 ++ rm /var/lib/jenkins/sccache_error.log 
Aug 05 09:53:54 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Aug 05 09:53:54 ++ SCCACHE_IDLE_TIMEOUT=1200 
Aug 05 09:53:54 ++ RUST_LOG=sccache::server=error 
Aug 05 09:53:54 ++ sccache --start-server 
Aug 05 09:53:54 Starting sccache server... 
Aug 05 09:53:54 ++ sccache --zero-stats 
Aug 05 09:53:54 Compile requests                 0 
Aug 05 09:53:54 Compile requests executed        0

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 05 09:40:51 caused by: Connection refused (os error 111)

Aug 05 09:40:51 ++++ extract_trap_cmd 
Aug 05 09:40:51 ++++ printf '%s\n' '' 
Aug 05 09:40:51 +++ printf '%s\n' cleanup 
Aug 05 09:40:51 ++ trap -- ' 
Aug 05 09:40:51 cleanup' EXIT 
Aug 05 09:40:51 ++ [[ pytorch-linux-xenial-py3.6-gcc5.4-ge_config_profiling-test != *pytorch-win-* ]] 
Aug 05 09:40:51 ++ which sccache 
Aug 05 09:40:51 ++ sccache --stop-server 
Aug 05 09:40:51 Stopping sccache server... 
Aug 05 09:40:51 error: couldn't connect to server 
Aug 05 09:40:51 caused by: Connection refused (os error 111) 
Aug 05 09:40:51 ++ true 
Aug 05 09:40:51 ++ rm /var/lib/jenkins/sccache_error.log 
Aug 05 09:40:51 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Aug 05 09:40:51 ++ SCCACHE_IDLE_TIMEOUT=1200 
Aug 05 09:40:51 ++ RUST_LOG=sccache::server=error 
Aug 05 09:40:51 ++ sccache --start-server 
Aug 05 09:40:51 Starting sccache server... 
Aug 05 09:40:51 ++ sccache --zero-stats 
Aug 05 09:40:51 Compile requests                 0 
Aug 05 09:40:51 Compile requests executed        0

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 43 times.

Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `@requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `@dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This diff removes the two decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class. Differential Revision: [D22283179](https://our.internmc.facebook.com/intern/diff/D22283179/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22283179/)! [ghstack-poisoned]

facebook-github-bot · 2020-08-06T00:20:06Z

This pull request has been merged in a94039f.

lw requested review from mrshenli, pritamdamania87 and zhaojuanmao as code owners June 30, 2020 20:56

lw mentioned this pull request Jul 1, 2020

[RPC tests] Avoid decorators to skip tests #40777

Closed

lw mentioned this pull request Jul 1, 2020

[RPC tests] Fix @_skip_if_tensorpipe always skipping for all agents #40860

Closed

lw added 4 commits July 1, 2020 09:00

lw mentioned this pull request Jul 2, 2020

[RPC tests] Fix file descriptor leak #40913

Closed

lw added 5 commits July 3, 2020 07:08

This was referenced Aug 4, 2020

[RPC tests] Generate test classes automatically #42527

Closed

[RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too #42528

Closed

facebook-github-bot closed this in a94039f Aug 5, 2020

facebook-github-bot added the merged label Aug 6, 2020

lw deleted the gh/lw/51/head branch August 6, 2020 16:28

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RPC tests] Avoid decorators to skip tests #40819

[RPC tests] Avoid decorators to skip tests #40819

Uh oh!

lw commented Jun 30, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jun 30, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[RPC tests] Avoid decorators to skip tests #40819

[RPC tests] Avoid decorators to skip tests #40819

Uh oh!

Conversation

lw commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the entire stack:

Summary of this commit

Uh oh!

dr-ci bot commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test (1/2)

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test (2/2)

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lw commented Jun 30, 2020 •

edited

Loading

dr-ci bot commented Jun 30, 2020 •

edited

Loading