torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

heiner · 2019-08-21T14:58:56Z

🐛 Bug

In a situation in which different Python threads execute CUDA operations on different devices, calling torch.cuda.synchronize blocks CUDA executions on all threads, including those on other CUDA devices.

To Reproduce

git clone https://gist.github.com/c812a38a338878f5c02f6193511afc6a.git cudasync
cd cudasync/
OMP_NUM_THREADS=1 python cudasync.py (produces trace file)

Expected behavior

torch.cuda.sync(device=my_device) should not affect execution of CUDA operations on devices other than my_device.

Environment

PyTorch version: 1.3.0.dev20190816
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 410.79
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.16.4
[pip] torch==1.3.0.dev20190816
[pip] torchvision==0.5.0a0+19315e3
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl_fft                   1.0.12           py37ha843d7b_0
[conda] mkl_random                1.0.2            py37hd81dba3_0
[conda] pytorch                   1.3.0.dev20190816 py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch-nightly
[conda] torchvision               0.5.0.dev20190816      py37_cu100    pytorch-nightly

Additional context

Trace file: cudasync.trace.gz

This probably isn't a GIL issue as it doesn't seem to happen when the other threads execute CPU PyTorch operations.

Perfetto link to trace: https://ui.perfetto.dev/#!/?s=76397c96cea6a47c45aed36cd84586cf54469d34089d3578afb7e795219229

Screenshot:

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2019-08-21T15:33:23Z

@VitalyFedyunin

soumith · 2019-08-23T03:33:08Z

as far as I know, these are expected CUDA semantics. It synchronizes the entire device context in the process, at a driver level.

heiner · 2019-08-23T10:07:32Z

Block execution on a different device?

soumith · 2019-08-23T16:49:16Z

I misread that. that sounds suspicious, cc: @csarofeen @ptrblck any ideas what's up?

csarofeen · 2019-08-23T17:18:51Z

Does sound suspicious, we'll have to take a look.

csarofeen · 2019-09-19T14:59:44Z

@ptrblck will take a look at this.

ptrblck · 2019-12-21T03:18:48Z

I've taken multiple shots at this issue and tried to reproduce it.
However, I cannot see any issues in blocking CUDA ops on different devices using torch.multiprocessing, so my best guess is it might be related to Python's multi-threading.

@heiner I also cannot see the synchronizations in the provided profile, so I used nsight-systems instead. Also, it seems you've just profiled the randint creation, not the complete forward/backward pass. Could you give me some more information about the use case, so that I could continue debugging?

heiner · 2019-12-23T13:18:22Z

Hey @ptrblck, thanks for taking a stab at this!

I am not surprised using multiple devices works fine with torch.multiprocessing. This bug is about multi-threading. In our use case, the data we consume and learn from is itself generated using a PyTorch module. This is common in reinforcement learning (multiple "actors" consume environment outputs and produce actions, while a centralized "learner" consumes all the actor inputs and outputs and updates the weights). In that setting, multi-threading is a much more natural fit, while getting this setup to work well with multiprocessing is tricky and probably requires additional memcopies.

As for only profiling the "randint": Note that the line in https://gist.github.com/heiner/c812a38a338878f5c02f6193511afc6a#file-cudasync-py-L76

torch.autograd.profiler.record_function("randint"):

is only an (optional) annotation of that statement, not profiling only that block. The statement that requests profiling of the overall program should be https://gist.github.com/heiner/c812a38a338878f5c02f6193511afc6a#file-cudasync-py-L137

         with torch.autograd.profiler.profile() as prof:
            train()

Now I agree with your assessment that this bug might not be an issue with CUDA synchronization but rather about the GIL. Notice though that not using CUDA creates a different profiling picture, namely one where not all threads are blocked at the same time. Could it be the case that some CUDA-specific codepath in PyTorch is holding the GIL in a situation where that's not necessary?

zhangguanheng66 added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

heiner commented Aug 21, 2019 •

edited

zhangguanheng66 commented Aug 21, 2019

soumith commented Aug 23, 2019

heiner commented Aug 23, 2019

soumith commented Aug 23, 2019

csarofeen commented Aug 23, 2019

csarofeen commented Sep 19, 2019

ptrblck commented Dec 21, 2019

heiner commented Dec 23, 2019

torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

torch.cuda.synchronize blocks CUDA execution on other threads using other devices. #24963

Comments

heiner commented Aug 21, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

zhangguanheng66 commented Aug 21, 2019

soumith commented Aug 23, 2019

heiner commented Aug 23, 2019

soumith commented Aug 23, 2019

csarofeen commented Aug 23, 2019

csarofeen commented Sep 19, 2019

ptrblck commented Dec 21, 2019

heiner commented Dec 23, 2019

heiner commented Aug 21, 2019 •

edited