[RPC tests] Merge tests for faulty agent into single script #40817

lw · 2020-06-30T20:55:58Z

Stack from ghstack:

[RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too #42528 [RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too
[RPC tests] Generate test classes automatically #42527 [RPC tests] Generate test classes automatically
[RPC tests] Enroll TensorPipe in missing test suites #40823 [RPC tests] Enroll TensorPipe in missing test suites
[RPC tests] Remove global TEST_CONFIG #40822 [RPC tests] Remove global TEST_CONFIG
[RPC tests] Move some functions to methods of fixture #40821 [RPC tests] Move some functions to methods of fixture
[RPC tests] Make generic fixture an abstract base class #40820 [RPC tests] Make generic fixture an abstract base class
[RPC tests] Avoid decorators to skip tests #40819 [RPC tests] Avoid decorators to skip tests
[RPC tests] Merge process group tests into single entry point #40818 [RPC tests] Merge process group tests into single entry point
[RPC tests] Merge tests for faulty agent into single script #40817 [RPC tests] Merge tests for faulty agent into single script
[RPC tests] Merge TensorPipe tests into single entry point #40816 [RPC tests] Merge TensorPipe tests into single entry point

Summary of the entire stack:

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:

Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
These two ways lead to having two separate decorators (@requires_process_group_agent and @_skip_if_tensorpipe_agent) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
Thrift must override the TEST_CONFIG global variable before any other import (in order for the @requires_process_group_agent decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in @dist_init).
There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:

Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:

It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit

This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script.

Differential Revision: D22283178

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `@requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `@dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script. Differential Revision: [D22283178](https://our.internmc.facebook.com/intern/diff/D22283178/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22283178/)! [ghstack-poisoned]

pritamdamania87 · 2020-07-01T01:13:21Z

test/distributed/rpc/test_faulty_agent.py

+#!/usr/bin/env python3
+import unittest
+
+from torch.testing._internal.common_distributed import MultiProcessTestCase


Similar to my comment in #40816 (comment), is it possible to avoid having separate files and enable different agents based on env variables?

See my comment there.

Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `@requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `@dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script. Differential Revision: [D22283178](https://our.internmc.facebook.com/intern/diff/D22283178/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22283178/)! [ghstack-poisoned]

dr-ci · 2020-07-01T17:46:43Z

💊 CI failures summary and remediations

As of commit 05cc4f8 (more details on the Dr. CI page):

1/3 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
2/3 broken upstream at merge base 0f358fa from Aug 04 until Aug 05 (4 commits; 882ad11 - 924a1db)

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test from Aug 04 until Aug 05 (4 commits; 882ad11 - 924a1db)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test from Aug 04 until Aug 05 (4 commits; 882ad11 - 924a1db)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 34 times.

Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`@requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `@requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `@dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script. Differential Revision: [D22283178](https://our.internmc.facebook.com/intern/diff/D22283178/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22283178/)! [ghstack-poisoned]

facebook-github-bot · 2020-08-06T00:20:06Z

This pull request has been merged in b93c7c5.

lw requested review from mrshenli, pritamdamania87 and zhaojuanmao as code owners June 30, 2020 20:55

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jun 30, 2020

pritamdamania87 reviewed Jul 1, 2020

View reviewed changes

lw mentioned this pull request Jul 1, 2020

[RPC tests] Merge tests for faulty agent into single script #40775

Closed

lw mentioned this pull request Jul 1, 2020

[RPC tests] Fix @_skip_if_tensorpipe always skipping for all agents #40860

Closed

lw added 2 commits July 1, 2020 09:00

lw added 2 commits July 1, 2020 11:41

lw mentioned this pull request Jul 2, 2020

[RPC tests] Fix file descriptor leak #40913

Closed

lw added 4 commits July 3, 2020 07:08

pritamdamania87 approved these changes Jul 31, 2020

View reviewed changes

This was referenced Aug 4, 2020

[RPC tests] Generate test classes automatically #42527

Closed

[RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too #42528

Closed

lw added 2 commits August 5, 2020 01:52

facebook-github-bot closed this in b93c7c5 Aug 5, 2020

facebook-github-bot added the merged label Aug 6, 2020

lw deleted the gh/lw/49/head branch August 6, 2020 16:28

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RPC tests] Merge tests for faulty agent into single script #40817

[RPC tests] Merge tests for faulty agent into single script #40817

Uh oh!

lw commented Jun 30, 2020 •

edited

Loading

Uh oh!

pritamdamania87 Jul 1, 2020

Uh oh!

lw Jul 1, 2020

Uh oh!

dr-ci bot commented Jul 1, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[RPC tests] Merge tests for faulty agent into single script #40817

[RPC tests] Merge tests for faulty agent into single script #40817

Uh oh!

Conversation

lw commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the entire stack:

Summary of this commit

Uh oh!

pritamdamania87 Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

lw Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 2 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lw commented Jun 30, 2020 •

edited

Loading

dr-ci bot commented Jul 1, 2020 •

edited

Loading