Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

suffiank · 2022-05-10T03:37:31Z

🐛 Describe the bug

I am observing 50+ ms bubbles in the GPU streams due to the pinning thread freezing out the main threads. The entire training pass would be 335 ms if the freezing were not present, so it makes for a significant performance reduction. It's worse still in a many-GPU environment where the GPUs need to synchronize each pass.

Environment: 8 x Nvidia A100.
Kernel: Linux 5.13.0-1024-gcp and Ubuntu 20.04.
Docker container: nvcr.io/nvidia/pytorch:22.03-py3 (contains torch 1.12.0a0+2c916ef)
Script: Running ViT-small training from DINO with ImageNet data.

This screen grab from Nvidia Nsight traces the bubbles in the GPU stream back to the pinning thread (labeled "receive data" on left) freezing out the main threads (labeled "PyT main" and "PyT b/w" on left):

From observing the backtrace the pinning thread likely holds the Python GIL while invoking a tensor destructor on this line:

pytorch/torch/utils/data/_utils/pin_memory.py

Line 38 in 2e2200d

r = (idx, data)

It looks to me like the tensors on the data queue goes out of scope here. These tensors are passed to the pinning thread from one of the worker processes and exist in a shared memory region. The tensor destructor spends a long time in syscall munmap.

To test the theory I made a hack to the torch dataloader to coerce the tensor deletion to happen in the worker process rather than than pinning thread. The bubbles went away.

Here's the GPU activity on all eight A100 streams before (bubbles circled):

Here's after the hack (no observable bubbles):

As an aside, if you further replace torch AdamW optimizer with Nvidia's apex version it's clean:

Can we ensure there is no thread contention between pinning thread and the main threads? Is there any reason the worker process can't destruct tensors on the data queue rather than the pinning thread?

Versions

Collecting environment information...
PyTorch version: 1.12.0a0+2c916ef
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-1024-gcp-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
GPU 4: A100-SXM4-40GB
GPU 5: A100-SXM4-40GB
GPU 6: A100-SXM4-40GB
GPU 7: A100-SXM4-40GB

Nvidia driver version: 450.172.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.12.0a0+2c916ef
[pip3] torch-tensorrt==1.1.0a0
[pip3] torchtext==0.12.0a0
[pip3] torchvision==0.13.0a0
[conda] magma-cuda110 2.5.2 5 local
[conda] mkl 2019.5 281 conda-forge
[conda] mkl-include 2019.5 281 conda-forge
[conda] numpy 1.22.3 py38h05e7239_0 conda-forge
[conda] pytorch-quantization 2.1.2 pypi_0 pypi
[conda] torch 1.12.0a0+2c916ef pypi_0 pypi
[conda] torch-tensorrt 1.1.0a0 pypi_0 pypi
[conda] torchtext 0.12.0a0 pypi_0 pypi
[conda] torchvision 0.13.0a0 pypi_0 pypi

cc @VitalyFedyunin @ngimel @ssnl @ejguan @NivekT

ngimel · 2022-05-10T16:19:22Z

cc @ptrblck

suffiank · 2022-05-17T14:34:56Z

@VitalyFedyunin @ptrblck Any thoughts or plans? I suspect this thread contention affects other workloads as well. I recall a performance drop in the past for HuggingFace BERT on a single V100 also being due to a data loader thread contention.

Another potential fix, apart from the data worker destructing tensors on the data queue, might be dropping the Python GIL during tensor destruction. Though I'm not sure how safe the latter is.

suffiank · 2022-06-08T16:32:15Z

@ngimel Is anyone interested in fixing this?

ngimel · 2022-06-13T19:57:40Z

We will accept a PR fixing this

ngimel · 2022-06-13T19:58:36Z

cc @ejguan

suffiank · 2022-06-21T17:07:32Z

Thanks @ngimel. I will see if I can submit a simple but viable fix in which the data worker does the tensor destruction. It might be a week before I get time to work on it.

zhengwy888 · 2022-08-16T06:55:52Z

Curious if anyone has an idea on how to get around destructing tensor blocks GIL? We found the GIL was blocked for 200ms during each iteration due to having to munmap a couple big tensors. Since I am not familiar with pytorch internal, these ideas might be really bad.

create another process for the purpose of tensor destruction.
transfer the ownership of the tensors to a c++ Tensor object, so during destruction there is no GIL lock. I am not sure if this ownership is even transferrable though.
use torchscript to destroy the storage and hoping that torchscript doesn't hold the GIL.

ejguan · 2022-08-16T13:09:28Z

IIUC, this question is more related to how to delete Tensor without blocking the main process. Wondering do you have any recommendation on this topic? @ezyang

ezyang · 2022-08-16T15:03:59Z

If the thread doing the munmap is fine unblocking, then the fix is very simple: just release the GIL before doing the munmap. Does anyone know which munmap is the relevant one from the traces?

If we need to arrange for a different thread to actually do the munmap, that's a little more complicated, but still doable. In any case the lesson is clear: don't hold a lock when doing expensive operations.

igozali · 2022-08-16T22:11:44Z

Happened to be debugging this, seems like it might be this one?

#0  0x00001555552aaadb in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x000015550857936c in at::MapAllocator::close (this=0x154df15b5a80) at ../aten/src/ATen/MapAllocator.cpp:411
#2  0x00001555085795bf in at::MapAllocator::~MapAllocator (this=0x154df15b5a80, __in_chrg=<optimized out>) at ../aten/src/ATen/MapAllocator.cpp:609
#3  0x000015550857960d in at::MapAllocator::~MapAllocator (this=0x154df15b5a80, __in_chrg=<optimized out>) at ../aten/src/ATen/MapAllocator.cpp:607
#4  0x000015550f758c7c in std::unique_ptr<void, void (*)(void*)>::reset (__p=<optimized out>, this=0x154df15b5ce0) at /usr/include/c++/9/bits/move.h:182
#5  std::unique_ptr<void, void (*)(void*)>::operator=(decltype(nullptr)) (this=0x154df15b5ce0) at /usr/include/c++/9/bits/unique_ptr.h:336
#6  c10::detail::UniqueVoidPtr::clear (this=0x154df15b5cd8) at ../c10/util/UniqueVoidPtr.h:54
#7  c10::DataPtr::clear (this=0x154df15b5cd8) at ../c10/core/Allocator.h:36
#8  c10::StorageImpl::release_resources (this=0x154df15b5cc0) at ../c10/core/StorageImpl.h:89
#9  0x00001554eb73ed25 in c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::reset_ (this=<synthetic pointer>)
    at ../c10/util/Exception.h:437
#10 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::~intrusive_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>)
    at ../c10/util/intrusive_ptr.h:366
#11 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::operator=<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&) & (rhs=..., this=0x154df15b5d98)
    at ../c10/util/intrusive_ptr.h:378
#12 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::operator=(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&) & (rhs=..., this=0x154df15b5d98) at ../c10/util/intrusive_ptr.h:370
#13 c10::Storage::operator= (this=0x154df15b5d98) at ../c10/core/Storage.h:7
#14 c10::TensorImpl::release_resources (this=0x154df15b5d80) at ../c10/core/TensorImpl.cpp:372
#15 0x000015550f654439 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x1550c59ef918) at /usr/include/c++/9/bits/atomic_base.h:549
#16 0x000015550f95f2b3 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x1550c59ef918, __in_chrg=<optimized out>)
    at ../c10/util/intrusive_ptr.h:365
#17 at::TensorBase::~TensorBase (this=0x1550c59ef918, __in_chrg=<optimized out>) at ../aten/src/ATen/core/TensorBase.h:77
#18 at::Tensor::~Tensor (this=0x1550c59ef918, __in_chrg=<optimized out>) at aten/src/ATen/core/TensorBody.h:85
#19 c10::MaybeOwned<at::Tensor>::operator= (rhs=..., this=0x1550c59ef910) at ../c10/util/MaybeOwned.h:145
#20 c10::MaybeOwned<at::Tensor>::operator= (rhs=..., this=0x1550c59ef910) at ../c10/util/MaybeOwned.h:138
#21 THPVariable_clear (self=self@entry=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:292
#22 0x000015550f95f635 in THPVariable_subclass_dealloc (self=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:1383
#23 THPVariable_subclass_dealloc (self=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:1296
#24 0x000055555566a888 in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.8.13/Objects/object.c:2207
#25 _Py_DECREF (op=<optimized out>, lineno=541, filename=<synthetic pointer>) at /usr/local/src/conda/python-3.8.13/Include/object.h:478
#26 _Py_XDECREF (op=<optimized out>) at /usr/local/src/conda/python-3.8.13/Include/object.h:541
...

ezyang · 2022-08-17T04:26:54Z

hmm, I wonder if we can safely release gil in tp_clear...

igozali · 2022-08-17T16:59:09Z

THPVariable_clear seems like a decent place to put it if there's no concern that someone would accidentally add code that touches Python objects around there. Based on discussions with some other folks, seems like a good place to slap in a pybind11::gil_scoped_release no_gil; would be right before this line?

At the same time, wondering if it would be safer to minimize the scope of the GIL release to the munmap call though e.g. put the GIL release in at::MapAllocator::close or at::MapAllocator::~MapAllocator?

ezyang · 2022-08-17T17:01:37Z

Give it a try. I'm not sure it's entirely kosher but maybe you can play around with it.

A slightly safer thing to do is to move the Variable out of the location (setting it to empty while the GIL is held), and THEN destruct it without the GIL.

albanD · 2022-08-17T17:01:41Z

hmm, I wonder if we can safely release gil in tp_clear...

Sam says that it is ok to locally release the gil there.

@igozali

Fixes #77139, where deallocating large tensors with munmap takes a significant amount of time while holding the GIL. This causes the pin_memory thread to interfere with the main thread = performance sadness. Thanks @igozali @zhengwy888 @colesbury as well. Pull Request resolved: #83623 Approved by: https://github.com/albanD

@igozali

Thanks @igozali @zhengwy888 @colesbury as well.

suffiank · 2022-08-18T16:31:14Z

Thanks folks. It's nice to see this likely fixed. Sorry this fell off my radar after I wasn't able to focus on it.

Summary: Fixes #77139, where deallocating large tensors with munmap takes a significant amount of time while holding the GIL. This causes the pin_memory thread to interfere with the main thread = performance sadness. Thanks igozali zhengwy888 colesbury as well. Pull Request resolved: #83623 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/abcf01196cd27805349aa892db847f9a61f52c0e Reviewed By: atalman Differential Revision: D38831020 Pulled By: atalman fbshipit-source-id: ed3d7f20e2872e2aa677d4d36abed7d73e9bbae1

…pytorch#83623)" This reverts commit abcf011.

…ch#77139 (pytorch#83623)"" This reverts commit 8f73708.

…ch#77139 (pytorch#83623)"" - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

…ch#77139 (pytorch#83623)"" (#1199) - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

…ch#77139 (pytorch#83623)"" This reverts commit 8f73708.

…ch#77139 (pytorch#83623)"" (#1199) - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

samdow added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 10, 2022

ngimel added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Jun 13, 2022

cchan mentioned this issue Aug 17, 2022

Release the GIL when munmap'ing tensors - fixes #77139 #83623

Closed

pytorchmergebot pushed a commit to cchan/pytorch that referenced this issue Aug 18, 2022

Release the GIL when munmap'ing tensors - fixes pytorch#77139

9f54183

Thanks @igozali @zhengwy888 @colesbury as well.

pytorchmergebot closed this as completed in abcf011 Aug 18, 2022

pruthvistony added a commit to ROCm/pytorch that referenced this issue Nov 24, 2022

Revert "Release the GIL when munmap'ing tensors - fixes pytorch#77139 (…

ef6560e

…pytorch#83623)" This reverts commit abcf011.

pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 2, 2022

Revert "Release the GIL when munmap'ing tensors - fixes pytorch#77139 (…

feb2ec4

…pytorch#83623)" This reverts commit abcf011.

pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 2, 2022

Revert "Release the GIL when munmap'ing tensors - fixes pytorch#77139 (…

8f73708

…pytorch#83623)" This reverts commit abcf011.

This was referenced Dec 23, 2022

Perf reduction due to munmap with dataloader pinning thread ? #91353

Open

performance problem on dataloader with pinned memory #89753

Closed

ZHUI mentioned this issue Jan 16, 2023

Refine munmap freq for RefcountedMemoryMapAllocation PaddlePaddle/Paddle#49691

Merged

jithunnair-amd added a commit to ROCm/pytorch that referenced this issue Mar 8, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

b61594c

…ch#77139 (pytorch#83623)"" This reverts commit 8f73708.

jithunnair-amd added a commit to ROCm/pytorch that referenced this issue Mar 8, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

9cb7d31

…ch#77139 (pytorch#83623)"" This reverts commit 8f73708.

pruthvistony added a commit to ROCm/pytorch that referenced this issue Mar 10, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

3442fdb

…ch#77139 (pytorch#83623)"" - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

pruthvistony added a commit to ROCm/pytorch that referenced this issue Mar 13, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

8aaed22

…ch#77139 (pytorch#83623)"" (#1199) - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

pruthvistony added a commit to ROCm/pytorch that referenced this issue May 3, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

0dc2c1e

…ch#77139 (pytorch#83623)"" This reverts commit 8f73708.

jataylo pushed a commit to ROCm/pytorch that referenced this issue Jul 12, 2023

Revert "Revert "Release the GIL when munmap'ing tensors - fixes pytor…

8622ce8

…ch#77139 (pytorch#83623)"" (#1199) - This reverts commit 8f73708. - SWDEV-386723 - superbench perf regression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

suffiank commented May 10, 2022 •

edited by pytorch-bot bot

ngimel commented May 10, 2022

suffiank commented May 17, 2022

suffiank commented Jun 8, 2022

ngimel commented Jun 13, 2022

ngimel commented Jun 13, 2022

suffiank commented Jun 21, 2022

zhengwy888 commented Aug 16, 2022

ejguan commented Aug 16, 2022

ezyang commented Aug 16, 2022

igozali commented Aug 16, 2022

ezyang commented Aug 17, 2022

igozali commented Aug 17, 2022

ezyang commented Aug 17, 2022

albanD commented Aug 17, 2022

suffiank commented Aug 18, 2022

Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

Comments

suffiank commented May 10, 2022 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

ngimel commented May 10, 2022

suffiank commented May 17, 2022

suffiank commented Jun 8, 2022

ngimel commented Jun 13, 2022

ngimel commented Jun 13, 2022

suffiank commented Jun 21, 2022

zhengwy888 commented Aug 16, 2022

ejguan commented Aug 16, 2022

ezyang commented Aug 16, 2022

igozali commented Aug 16, 2022

ezyang commented Aug 17, 2022

igozali commented Aug 17, 2022

ezyang commented Aug 17, 2022

albanD commented Aug 17, 2022

suffiank commented Aug 18, 2022

suffiank commented May 10, 2022 •

edited by pytorch-bot bot