Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

Open
heiner opened this issue Aug 21, 2019 · 8 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@heiner
Copy link

heiner commented Aug 21, 2019

馃悰 Bug

In a situation in which different Python threads execute CUDA operations on different devices, calling torch.cuda.synchronize blocks CUDA executions on all threads, including those on other CUDA devices.

To Reproduce

  1. git clone https://gist.github.com/c812a38a338878f5c02f6193511afc6a.git cudasync
  2. cd cudasync/
  3. OMP_NUM_THREADS=1 python cudasync.py (produces trace file)

Expected behavior

torch.cuda.sync(device=my_device) should not affect execution of CUDA operations on devices other than my_device.

Environment

PyTorch version: 1.3.0.dev20190816
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 410.79
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.16.4
[pip] torch==1.3.0.dev20190816
[pip] torchvision==0.5.0a0+19315e3
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl_fft                   1.0.12           py37ha843d7b_0
[conda] mkl_random                1.0.2            py37hd81dba3_0
[conda] pytorch                   1.3.0.dev20190816 py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch-nightly
[conda] torchvision               0.5.0.dev20190816      py37_cu100    pytorch-nightly

Additional context

Trace file: cudasync.trace.gz

This probably isn't a GIL issue as it doesn't seem to happen when the other threads execute CPU PyTorch operations.

Perfetto link to trace: https://ui.perfetto.dev/#!/?s=76397c96cea6a47c45aed36cd84586cf54469d34089d3578afb7e795219229

Screenshot:
image

@zhangguanheng66
Copy link
Contributor

@VitalyFedyunin

@zhangguanheng66 zhangguanheng66 added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 21, 2019
@soumith
Copy link
Member

soumith commented Aug 23, 2019

as far as I know, these are expected CUDA semantics. It synchronizes the entire device context in the process, at a driver level.

@heiner
Copy link
Author

heiner commented Aug 23, 2019

Block execution on a different device?

@soumith
Copy link
Member

soumith commented Aug 23, 2019

I misread that. that sounds suspicious, cc: @csarofeen @ptrblck any ideas what's up?

@csarofeen
Copy link
Contributor

Does sound suspicious, we'll have to take a look.

@csarofeen
Copy link
Contributor

@ptrblck will take a look at this.

@ptrblck
Copy link
Collaborator

ptrblck commented Dec 21, 2019

I've taken multiple shots at this issue and tried to reproduce it.
However, I cannot see any issues in blocking CUDA ops on different devices using torch.multiprocessing, so my best guess is it might be related to Python's multi-threading.

@heiner I also cannot see the synchronizations in the provided profile, so I used nsight-systems instead. Also, it seems you've just profiled the randint creation, not the complete forward/backward pass. Could you give me some more information about the use case, so that I could continue debugging?

@heiner
Copy link
Author

heiner commented Dec 23, 2019

Hey @ptrblck, thanks for taking a stab at this!

I am not surprised using multiple devices works fine with torch.multiprocessing. This bug is about multi-threading. In our use case, the data we consume and learn from is itself generated using a PyTorch module. This is common in reinforcement learning (multiple "actors" consume environment outputs and produce actions, while a centralized "learner" consumes all the actor inputs and outputs and updates the weights). In that setting, multi-threading is a much more natural fit, while getting this setup to work well with multiprocessing is tricky and probably requires additional memcopies.

As for only profiling the "randint": Note that the line in https://gist.github.com/heiner/c812a38a338878f5c02f6193511afc6a#file-cudasync-py-L76

torch.autograd.profiler.record_function("randint"):

is only an (optional) annotation of that statement, not profiling only that block. The statement that requests profiling of the overall program should be https://gist.github.com/heiner/c812a38a338878f5c02f6193511afc6a#file-cudasync-py-L137

         with torch.autograd.profiler.profile() as prof:
            train()

Now I agree with your assessment that this bug might not be an issue with CUDA synchronization but rather about the GIL. Notice though that not using CUDA creates a different profiling picture, namely one where not all threads are blocked at the same time. Could it be the case that some CUDA-specific codepath in PyTorch is holding the GIL in a situation where that's not necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants