Fix Process Group for tensors shared across processes #21449

mrshenli · 2019-06-06T03:57:58Z

Ops on a Process Group (pg) instance will hit an error when input/output tensors are created on a different process, because, pg calls recordStream on CUDACachingAllocator which only knows tensors created within the same process.

The proposed solution is to add a suppressError arg (suggestions for better names?) to recordStream. See comments in code for arguments.

CC @pichuang1984

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-06-07T12:32:54Z

Out of curiosity, why didn't we test for this situation using cast_context on the DataPtr? It seems like a safer test to do.

ezyang · 2019-06-07T12:38:12Z

You seem to have modified nearly every call site to recordStream; I saw maybe only two missing sites:

torch/csrc/distributed/c10d/ddp.cpp
165:      c10::cuda::CUDACachingAllocator::recordStream(
209:    c10::cuda::CUDACachingAllocator::recordStream(

torch/csrc/autograd/generated/python_variable_methods.cpp
390:  c10::cuda::CUDACachingAllocator::recordStream(data, at::cuda::CUDAStream::unpack(((THCPStream*)arg)->cdata));

and I suspect that these should not error if the pointer is not found. Maybe we should do away with the option / flip the default?

ezyang · 2019-06-07T14:06:24Z

test/test_c10d_spawn.py

+
+    @property
+    def world_size(self):
+        return 2


Wouldn't world_size = 2 be a much more idiomatic way to write this?

Yes, let me make the change.

ezyang · 2019-06-07T14:07:38Z

test/test_c10d_spawn.py

+    @unittest.skipIf(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed")
+    @unittest.skipIf(NO_NCCL, "NCCL needed")
+    def test_shared_allgather_nccl(self):
+        with tempfile.NamedTemporaryFile(delete=False) as file:


Generally speaking, it's better not to leak temporary files (which is what delete=False does, unless you explicitly delete it later). What is the reasoning for using delete=false?

I might be wrong, but if I don't add delete=False, it complains it cannot find the tmp file on delete. I was thinking maybe with context and tempfile both try to delete the file when exiting the context, as a result, one of them hits the error. But let me double check if that is the case, and will ad a comment if yes.

I would be extremely surprised if that were the case. Here's a simple test:

macbook-pro-116:~ ezyang$ cat test.py import tempfile with tempfile.NamedTemporaryFile() as f: pass macbook-pro-116:~ ezyang$ python test.py macbook-pro-116:~ ezyang$

mrshenli · 2019-06-07T14:08:08Z

@ezyang thanks a lot for reviewing!

Out of curiosity, why didn't we test for this situation using cast_context on the DataPtr? It seems like a safer test to do.

I agree this will test the changes to CUDACachingAllocator, but this is not what users are experiencing. I could add that test, but still want to keep the multiprocessing test, as the goal of the test is to make sure users don't hit error anymore when they use PG to work on tensors shared across processes. If we don't have the multiprocessing test, we would not know whether we need additional changes other than modifying CUDACachingAllocator.

You seem to have modified nearly every call site to recordStream; I saw maybe only two missing sites:

For the recordStream in DDP, I don't see how users would get a gradient tensor from a different process. If they do, I would rather expose the error instead of silently suppress it, because it could mean we need a bigger change in DDP for certain cases.
For the python API, I have a similar concern as above. The recordStream is used in Scatter function, where the input should not be a shared tensor.

ezyang · 2019-06-07T14:13:11Z

test/test_c10d_spawn.py

+    def assert_equal(cls, expected, value):
+        assert (expected == value).all().item() == 1, (
+            "Expecting tensor value {} but got {}."
+        ).format(expected, value)


This is a bit suboptimal, because the way this will look to the test runner is that the multiprocessing spawned subprocess died unceremoniously, and you have to look at the tea leaves to see, "Ah, it failed due to an assert." The way that test_multiprocessing goes about arranging this, is to have the parent process (aka the test runner) always responsible for doing the actual asserts (where you can do a normal self.assertEquals), and just have the child process pass back the tensors for the parent process to do checking on (or, if the child process must do the test, passing back a boolean saying if the result worked or not.)

What does the test suite output look like when this assert fails?

Hmm, I guess because you are using multiprocessing.spawn, the exceptions will get propagated backwards

I agree. I don't like reinventing the wheel either. Will make the change.

What does the test suite output look like when this assert fails?

I have only tried error cases in local pytest runs, where the error messages are clear. Let me add a deliberate error to see what CI test shows.

ezyang · 2019-06-07T14:14:36Z

test/test_c10d_spawn.py

+        ).format(expected, value)
+
+    # Why classmethod? multiprocessing cannot pickle TestCase subclass when in
+    # spawn mode. See https://bugs.python.org/issue33884.


FWIW, test_multiprocessing.py does this by just having the test runner methods as honest to goodness top-level functions. This is just an informational comment, since a class method is just as good.

ezyang

Please think about if we should get rid of suppressError argument and just suppress errors always. All the other comments are just nits.

ezyang · 2019-06-07T14:23:56Z

I agree this will test the changes to CUDACachingAllocator, but this is not what users are experiencing. I could add that test, but still want to keep the multiprocessing test, as the goal of the test is to make sure users don't hit error anymore when they use PG to work on tensors shared across processes. If we don't have the multiprocessing test, we would not know whether we need additional changes other than modifying CUDACachingAllocator.

Sorry, miscommunication here. I'm referring to how, in the logic of a PR, you determine if a pointer was produced by the CUDA caching allocator by attempting to look it up in the block map; if it's not in the block map, it's not a CUDA caching allocator pointer. My point is that another way to test if it's produced by the CUDA caching allocator is by checking if the deleter for the DataPtr in question is for the CUDA caching allocator (you can perform this test using cast_context). I'm not referring to the test suite.

For the recordStream in DDP, I don't see how users would get a gradient tensor from a different process. If they do, I would rather expose the error instead of silently suppress it, because it could mean we need a bigger change in DDP for certain cases.

Remember, "different process" isn't the root cause of the problem; if any cuda tensor we allocate, comes from an allocation source that is not the caching allocator, recordStream will fail. So I could have very well defined a backward function that uses some external system (e.g., Numba) to perform the computation, and share back some memory that was allocated by them. This seems perfectly legitimate to me.

Morally, CUDA caching allocator's record stream annotations are to solve a problem that exists specifically because of how the caching allocator is implemented. Arguably, it is a generic interface for all CUDA allocators, but it just so happens that other allocators don't need to do anything.

mrshenli · 2019-06-07T14:40:39Z

@ezyang if I feel timid about a PR, can I force test suite to run all 73 tests?

…ssage from subprocess assertion failures

…ed tests, now removing them

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-12T01:10:08Z

@mrshenli merged this pull request in 25d1496.

mrshenli requested review from apaszke and pietern as code owners June 6, 2019 03:57

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue module: internals Related to internal abstractions in c10 and ATen module: nccl Problems related to nccl support labels Jun 6, 2019

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

Fix Process Group for tensors shared across processes

d23f4c1

mrshenli force-pushed the record branch from f3feeb2 to d23f4c1 Compare June 6, 2019 15:08

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

1. fix lint; 2. add more tests

707128a

mrshenli changed the title ~~[WIP][Don't Review Yet] Fix Process Group for tensors shared across processes~~ Fix Process Group for tensors shared across processes Jun 6, 2019

mrshenli requested review from ezyang and colesbury June 6, 2019 17:34

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

mrshenli added 2 commits June 6, 2019 12:03

fix test skip

bfde4d4

fix rocm build

38b390d

pytorchbot added the module: rocm AMD GPU support for Pytorch label Jun 6, 2019

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

move spawn tests from test_c10d.py to test_c10d_spawn.py

5f37eda

mrshenli requested a review from bddppq June 6, 2019 21:56

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

ezyang reviewed Jun 7, 2019

View reviewed changes

ezyang approved these changes Jun 7, 2019

View reviewed changes

mrshenli added 3 commits June 7, 2019 07:52

use deliberate test error to check whether test suite can retrieve me…

62febed

…ssage from subprocess assertion failures

Try deliberate failure, 2nd attempt

40b8ab9

enable test_c10d_spawn.py in multi-gpu tests

8409560

pytorchbot added the module: ci Related to continuous integration label Jun 10, 2019

enable test_c10d_spawn.py, 2nd attempt

1a451e7

pytorchbot added the module: tests Issues related to tests (not the torch.testing module) label Jun 10, 2019

mrshenli added 2 commits June 10, 2019 15:51

verified that CI can show correct error messages on deliberately fail…

22e5724

…ed tests, now removing them

Merge remote-tracking branch 'upstream/master' into record

65a3200

pytorchbot added the module: third_party label Jun 11, 2019

revert third_party

b6f05ba

facebook-github-bot reviewed Jun 11, 2019

View reviewed changes

fix ROCM blacklist

abe828e

facebook-github-bot reviewed Jun 11, 2019

View reviewed changes

facebook-github-bot closed this in 25d1496 Jun 11, 2019

facebook-github-bot added the merged label Jun 12, 2019

colesbury mentioned this pull request Oct 4, 2019

record_stream() for shifted view tensors #27371

Closed

mrshenli mentioned this pull request Oct 4, 2019

Check DataPtr's deleter to determine if it is allocated by CUDA in record_stream #27405

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Process Group for tensors shared across processes #21449

Fix Process Group for tensors shared across processes #21449

mrshenli commented Jun 6, 2019 •

edited

Loading

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

ezyang commented Jun 7, 2019

ezyang commented Jun 7, 2019

ezyang Jun 7, 2019

mrshenli Jun 7, 2019

ezyang Jun 7, 2019

mrshenli Jun 7, 2019

ezyang Jun 7, 2019

mrshenli commented Jun 7, 2019

ezyang Jun 7, 2019

ezyang Jun 7, 2019

mrshenli Jun 7, 2019

ezyang Jun 7, 2019 •

edited

Loading

ezyang left a comment

ezyang commented Jun 7, 2019

mrshenli commented Jun 7, 2019

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Jun 12, 2019

Fix Process Group for tensors shared across processes #21449

Fix Process Group for tensors shared across processes #21449

Conversation

mrshenli commented Jun 6, 2019 • edited Loading

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

ezyang commented Jun 7, 2019

ezyang commented Jun 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli commented Jun 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang Jun 7, 2019 • edited Loading

Choose a reason for hiding this comment

ezyang left a comment

Choose a reason for hiding this comment

ezyang commented Jun 7, 2019

mrshenli commented Jun 7, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 12, 2019

mrshenli commented Jun 6, 2019 •

edited

Loading

ezyang Jun 7, 2019 •

edited

Loading