Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Process Group for tensors shared across processes #21449

Closed
wants to merge 13 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Jun 6, 2019

Ops on a Process Group (pg) instance will hit an error when input/output tensors are created on a different process, because, pg calls recordStream on CUDACachingAllocator which only knows tensors created within the same process.

The proposed solution is to add a suppressError arg (suggestions for better names?) to recordStream. See comments in code for arguments.

CC @pichuang1984

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue module: internals Related to internal abstractions in c10 and ATen module: nccl Problems related to nccl support labels Jun 6, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli mrshenli changed the title [WIP][Don't Review Yet] Fix Process Group for tensors shared across processes Fix Process Group for tensors shared across processes Jun 6, 2019
@mrshenli mrshenli requested review from ezyang and colesbury June 6, 2019 17:34
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@pytorchbot pytorchbot added the module: rocm AMD GPU support for Pytorch label Jun 6, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli mrshenli requested a review from bddppq June 6, 2019 21:56
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jun 7, 2019

Out of curiosity, why didn't we test for this situation using cast_context on the DataPtr? It seems like a safer test to do.

@ezyang
Copy link
Contributor

ezyang commented Jun 7, 2019

You seem to have modified nearly every call site to recordStream; I saw maybe only two missing sites:

torch/csrc/distributed/c10d/ddp.cpp
165:      c10::cuda::CUDACachingAllocator::recordStream(
209:    c10::cuda::CUDACachingAllocator::recordStream(

torch/csrc/autograd/generated/python_variable_methods.cpp
390:  c10::cuda::CUDACachingAllocator::recordStream(data, at::cuda::CUDAStream::unpack(((THCPStream*)arg)->cdata));

and I suspect that these should not error if the pointer is not found. Maybe we should do away with the option / flip the default?


@property
def world_size(self):
return 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't world_size = 2 be a much more idiomatic way to write this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me make the change.

@unittest.skipIf(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed")
@unittest.skipIf(NO_NCCL, "NCCL needed")
def test_shared_allgather_nccl(self):
with tempfile.NamedTemporaryFile(delete=False) as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, it's better not to leak temporary files (which is what delete=False does, unless you explicitly delete it later). What is the reasoning for using delete=false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but if I don't add delete=False, it complains it cannot find the tmp file on delete. I was thinking maybe with context and tempfile both try to delete the file when exiting the context, as a result, one of them hits the error. But let me double check if that is the case, and will ad a comment if yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be extremely surprised if that were the case. Here's a simple test:

macbook-pro-116:~ ezyang$ cat test.py
import tempfile

with tempfile.NamedTemporaryFile() as f:
    pass
macbook-pro-116:~ ezyang$ python test.py
macbook-pro-116:~ ezyang$ 

@mrshenli
Copy link
Contributor Author

mrshenli commented Jun 7, 2019

@ezyang thanks a lot for reviewing!

Out of curiosity, why didn't we test for this situation using cast_context on the DataPtr? It seems like a safer test to do.

I agree this will test the changes to CUDACachingAllocator, but this is not what users are experiencing. I could add that test, but still want to keep the multiprocessing test, as the goal of the test is to make sure users don't hit error anymore when they use PG to work on tensors shared across processes. If we don't have the multiprocessing test, we would not know whether we need additional changes other than modifying CUDACachingAllocator.

You seem to have modified nearly every call site to recordStream; I saw maybe only two missing sites:

  1. For the recordStream in DDP, I don't see how users would get a gradient tensor from a different process. If they do, I would rather expose the error instead of silently suppress it, because it could mean we need a bigger change in DDP for certain cases.

  2. For the python API, I have a similar concern as above. The recordStream is used in Scatter function, where the input should not be a shared tensor.

def assert_equal(cls, expected, value):
assert (expected == value).all().item() == 1, (
"Expecting tensor value {} but got {}."
).format(expected, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit suboptimal, because the way this will look to the test runner is that the multiprocessing spawned subprocess died unceremoniously, and you have to look at the tea leaves to see, "Ah, it failed due to an assert." The way that test_multiprocessing goes about arranging this, is to have the parent process (aka the test runner) always responsible for doing the actual asserts (where you can do a normal self.assertEquals), and just have the child process pass back the tensors for the parent process to do checking on (or, if the child process must do the test, passing back a boolean saying if the result worked or not.)

What does the test suite output look like when this assert fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess because you are using multiprocessing.spawn, the exceptions will get propagated backwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I don't like reinventing the wheel either. Will make the change.

What does the test suite output look like when this assert fails?

I have only tried error cases in local pytest runs, where the error messages are clear. Let me add a deliberate error to see what CI test shows.

).format(expected, value)

# Why classmethod? multiprocessing cannot pickle TestCase subclass when in
# spawn mode. See https://bugs.python.org/issue33884.
Copy link
Contributor

@ezyang ezyang Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, test_multiprocessing.py does this by just having the test runner methods as honest to goodness top-level functions. This is just an informational comment, since a class method is just as good.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please think about if we should get rid of suppressError argument and just suppress errors always. All the other comments are just nits.

@ezyang
Copy link
Contributor

ezyang commented Jun 7, 2019

I agree this will test the changes to CUDACachingAllocator, but this is not what users are experiencing. I could add that test, but still want to keep the multiprocessing test, as the goal of the test is to make sure users don't hit error anymore when they use PG to work on tensors shared across processes. If we don't have the multiprocessing test, we would not know whether we need additional changes other than modifying CUDACachingAllocator.

Sorry, miscommunication here. I'm referring to how, in the logic of a PR, you determine if a pointer was produced by the CUDA caching allocator by attempting to look it up in the block map; if it's not in the block map, it's not a CUDA caching allocator pointer. My point is that another way to test if it's produced by the CUDA caching allocator is by checking if the deleter for the DataPtr in question is for the CUDA caching allocator (you can perform this test using cast_context). I'm not referring to the test suite.

For the recordStream in DDP, I don't see how users would get a gradient tensor from a different process. If they do, I would rather expose the error instead of silently suppress it, because it could mean we need a bigger change in DDP for certain cases.

Remember, "different process" isn't the root cause of the problem; if any cuda tensor we allocate, comes from an allocation source that is not the caching allocator, recordStream will fail. So I could have very well defined a backward function that uses some external system (e.g., Numba) to perform the computation, and share back some memory that was allocated by them. This seems perfectly legitimate to me.

Morally, CUDA caching allocator's record stream annotations are to solve a problem that exists specifically because of how the caching allocator is implemented. Arguably, it is a generic interface for all CUDA allocators, but it just so happens that other allocators don't need to do anything.

@mrshenli
Copy link
Contributor Author

mrshenli commented Jun 7, 2019

@ezyang if I feel timid about a PR, can I force test suite to run all 73 tests?

@pytorchbot pytorchbot added the module: ci Related to continuous integration label Jun 10, 2019
@pytorchbot pytorchbot added the module: tests Issues related to tests (not the torch.testing module) label Jun 10, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 25d1496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: ci Related to continuous integration module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: nccl Problems related to nccl support module: rocm AMD GPU support for Pytorch module: tests Issues related to tests (not the torch.testing module) module: third_party oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants