Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant perf reduction on Python GIL contention with dataloader pinning thread. #77139

Closed
suffiank opened this issue May 10, 2022 · 15 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: dataloader Related to torch.utils.data.DataLoader and Sampler module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@suffiank
Copy link

suffiank commented May 10, 2022

🐛 Describe the bug

I am observing 50+ ms bubbles in the GPU streams due to the pinning thread freezing out the main threads. The entire training pass would be 335 ms if the freezing were not present, so it makes for a significant performance reduction. It's worse still in a many-GPU environment where the GPUs need to synchronize each pass.

Environment: 8 x Nvidia A100.
Kernel: Linux 5.13.0-1024-gcp and Ubuntu 20.04.
Docker container: nvcr.io/nvidia/pytorch:22.03-py3 (contains torch 1.12.0a0+2c916ef)
Script: Running ViT-small training from DINO with ImageNet data.

This screen grab from Nvidia Nsight traces the bubbles in the GPU stream back to the pinning thread (labeled "receive data" on left) freezing out the main threads (labeled "PyT main" and "PyT b/w" on left):
InkedScreenshot from 2022-05-02 10-09-52_LI

From observing the backtrace the pinning thread likely holds the Python GIL while invoking a tensor destructor on this line:

It looks to me like the tensors on the data queue goes out of scope here. These tensors are passed to the pinning thread from one of the worker processes and exist in a shared memory region. The tensor destructor spends a long time in syscall munmap.

To test the theory I made a hack to the torch dataloader to coerce the tensor deletion to happen in the worker process rather than than pinning thread. The bubbles went away.

Here's the GPU activity on all eight A100 streams before (bubbles circled):
8proc-bubble-and-stall

Here's after the hack (no observable bubbles):
Screenshot from 2022-05-03 11-06-22

As an aside, if you further replace torch AdamW optimizer with Nvidia's apex version it's clean:
Screenshot from 2022-05-03 11-20-43

Can we ensure there is no thread contention between pinning thread and the main threads? Is there any reason the worker process can't destruct tensors on the data queue rather than the pinning thread?

Versions

Collecting environment information...
PyTorch version: 1.12.0a0+2c916ef
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-1024-gcp-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
GPU 4: A100-SXM4-40GB
GPU 5: A100-SXM4-40GB
GPU 6: A100-SXM4-40GB
GPU 7: A100-SXM4-40GB

Nvidia driver version: 450.172.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.12.0a0+2c916ef
[pip3] torch-tensorrt==1.1.0a0
[pip3] torchtext==0.12.0a0
[pip3] torchvision==0.13.0a0
[conda] magma-cuda110 2.5.2 5 local
[conda] mkl 2019.5 281 conda-forge
[conda] mkl-include 2019.5 281 conda-forge
[conda] numpy 1.22.3 py38h05e7239_0 conda-forge
[conda] pytorch-quantization 2.1.2 pypi_0 pypi
[conda] torch 1.12.0a0+2c916ef pypi_0 pypi
[conda] torch-tensorrt 1.1.0a0 pypi_0 pypi
[conda] torchtext 0.12.0a0 pypi_0 pypi
[conda] torchvision 0.13.0a0 pypi_0 pypi

cc @VitalyFedyunin @ngimel @ssnl @ejguan @NivekT

@samdow samdow added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 10, 2022
@ngimel
Copy link
Collaborator

ngimel commented May 10, 2022

cc @ptrblck

@suffiank
Copy link
Author

@VitalyFedyunin @ptrblck Any thoughts or plans? I suspect this thread contention affects other workloads as well. I recall a performance drop in the past for HuggingFace BERT on a single V100 also being due to a data loader thread contention.

Another potential fix, apart from the data worker destructing tensors on the data queue, might be dropping the Python GIL during tensor destruction. Though I'm not sure how safe the latter is.

@suffiank
Copy link
Author

suffiank commented Jun 8, 2022

@ngimel Is anyone interested in fixing this?

@ngimel ngimel added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Jun 13, 2022
@ngimel
Copy link
Collaborator

ngimel commented Jun 13, 2022

We will accept a PR fixing this

@ngimel
Copy link
Collaborator

ngimel commented Jun 13, 2022

cc @ejguan

@suffiank
Copy link
Author

Thanks @ngimel. I will see if I can submit a simple but viable fix in which the data worker does the tensor destruction. It might be a week before I get time to work on it.

@zhengwy888
Copy link
Contributor

Curious if anyone has an idea on how to get around destructing tensor blocks GIL? We found the GIL was blocked for 200ms during each iteration due to having to munmap a couple big tensors. Since I am not familiar with pytorch internal, these ideas might be really bad.

  1. create another process for the purpose of tensor destruction.
  2. transfer the ownership of the tensors to a c++ Tensor object, so during destruction there is no GIL lock. I am not sure if this ownership is even transferrable though.
  3. use torchscript to destroy the storage and hoping that torchscript doesn't hold the GIL.

@ejguan
Copy link
Contributor

ejguan commented Aug 16, 2022

IIUC, this question is more related to how to delete Tensor without blocking the main process. Wondering do you have any recommendation on this topic? @ezyang

@ezyang
Copy link
Contributor

ezyang commented Aug 16, 2022

If the thread doing the munmap is fine unblocking, then the fix is very simple: just release the GIL before doing the munmap. Does anyone know which munmap is the relevant one from the traces?

If we need to arrange for a different thread to actually do the munmap, that's a little more complicated, but still doable. In any case the lesson is clear: don't hold a lock when doing expensive operations.

@igozali
Copy link

igozali commented Aug 16, 2022

Happened to be debugging this, seems like it might be this one?

#0  0x00001555552aaadb in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x000015550857936c in at::MapAllocator::close (this=0x154df15b5a80) at ../aten/src/ATen/MapAllocator.cpp:411
#2  0x00001555085795bf in at::MapAllocator::~MapAllocator (this=0x154df15b5a80, __in_chrg=<optimized out>) at ../aten/src/ATen/MapAllocator.cpp:609
#3  0x000015550857960d in at::MapAllocator::~MapAllocator (this=0x154df15b5a80, __in_chrg=<optimized out>) at ../aten/src/ATen/MapAllocator.cpp:607
#4  0x000015550f758c7c in std::unique_ptr<void, void (*)(void*)>::reset (__p=<optimized out>, this=0x154df15b5ce0) at /usr/include/c++/9/bits/move.h:182
#5  std::unique_ptr<void, void (*)(void*)>::operator=(decltype(nullptr)) (this=0x154df15b5ce0) at /usr/include/c++/9/bits/unique_ptr.h:336
#6  c10::detail::UniqueVoidPtr::clear (this=0x154df15b5cd8) at ../c10/util/UniqueVoidPtr.h:54
#7  c10::DataPtr::clear (this=0x154df15b5cd8) at ../c10/core/Allocator.h:36
#8  c10::StorageImpl::release_resources (this=0x154df15b5cc0) at ../c10/core/StorageImpl.h:89
#9  0x00001554eb73ed25 in c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::reset_ (this=<synthetic pointer>)
    at ../c10/util/Exception.h:437
#10 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::~intrusive_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>)
    at ../c10/util/intrusive_ptr.h:366
#11 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::operator=<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&) & (rhs=..., this=0x154df15b5d98)
    at ../c10/util/intrusive_ptr.h:378
#12 c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::operator=(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&) & (rhs=..., this=0x154df15b5d98) at ../c10/util/intrusive_ptr.h:370
#13 c10::Storage::operator= (this=0x154df15b5d98) at ../c10/core/Storage.h:7
#14 c10::TensorImpl::release_resources (this=0x154df15b5d80) at ../c10/core/TensorImpl.cpp:372
#15 0x000015550f654439 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x1550c59ef918) at /usr/include/c++/9/bits/atomic_base.h:549
#16 0x000015550f95f2b3 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x1550c59ef918, __in_chrg=<optimized out>)
    at ../c10/util/intrusive_ptr.h:365
#17 at::TensorBase::~TensorBase (this=0x1550c59ef918, __in_chrg=<optimized out>) at ../aten/src/ATen/core/TensorBase.h:77
#18 at::Tensor::~Tensor (this=0x1550c59ef918, __in_chrg=<optimized out>) at aten/src/ATen/core/TensorBody.h:85
#19 c10::MaybeOwned<at::Tensor>::operator= (rhs=..., this=0x1550c59ef910) at ../c10/util/MaybeOwned.h:145
#20 c10::MaybeOwned<at::Tensor>::operator= (rhs=..., this=0x1550c59ef910) at ../c10/util/MaybeOwned.h:138
#21 THPVariable_clear (self=self@entry=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:292
#22 0x000015550f95f635 in THPVariable_subclass_dealloc (self=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:1383
#23 THPVariable_subclass_dealloc (self=0x1550c59ef900) at ../torch/csrc/autograd/python_variable.cpp:1296
#24 0x000055555566a888 in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.8.13/Objects/object.c:2207
#25 _Py_DECREF (op=<optimized out>, lineno=541, filename=<synthetic pointer>) at /usr/local/src/conda/python-3.8.13/Include/object.h:478
#26 _Py_XDECREF (op=<optimized out>) at /usr/local/src/conda/python-3.8.13/Include/object.h:541
...

@ezyang
Copy link
Contributor

ezyang commented Aug 17, 2022

hmm, I wonder if we can safely release gil in tp_clear...

@igozali
Copy link

igozali commented Aug 17, 2022

THPVariable_clear seems like a decent place to put it if there's no concern that someone would accidentally add code that touches Python objects around there. Based on discussions with some other folks, seems like a good place to slap in a pybind11::gil_scoped_release no_gil; would be right before this line?

At the same time, wondering if it would be safer to minimize the scope of the GIL release to the munmap call though e.g. put the GIL release in at::MapAllocator::close or at::MapAllocator::~MapAllocator?

@ezyang
Copy link
Contributor

ezyang commented Aug 17, 2022

Give it a try. I'm not sure it's entirely kosher but maybe you can play around with it.

A slightly safer thing to do is to move the Variable out of the location (setting it to empty while the GIL is held), and THEN destruct it without the GIL.

@albanD
Copy link
Collaborator

albanD commented Aug 17, 2022

hmm, I wonder if we can safely release gil in tp_clear...

Sam says that it is ok to locally release the gil there.

pytorchmergebot pushed a commit that referenced this issue Aug 17, 2022
Fixes #77139, where deallocating large tensors with munmap takes a significant amount of time while holding the GIL. This causes the pin_memory thread to interfere with the main thread = performance sadness.

Thanks @igozali @zhengwy888 @colesbury as well.
Pull Request resolved: #83623
Approved by: https://github.com/albanD
pytorchmergebot pushed a commit to cchan/pytorch that referenced this issue Aug 18, 2022
@suffiank
Copy link
Author

Thanks folks. It's nice to see this likely fixed. Sorry this fell off my radar after I wasn't able to focus on it.

facebook-github-bot pushed a commit that referenced this issue Aug 19, 2022
Summary:
Fixes #77139, where deallocating large tensors with munmap takes a significant amount of time while holding the GIL. This causes the pin_memory thread to interfere with the main thread = performance sadness.

Thanks igozali zhengwy888 colesbury as well.

Pull Request resolved: #83623
Approved by: https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/abcf01196cd27805349aa892db847f9a61f52c0e

Reviewed By: atalman

Differential Revision: D38831020

Pulled By: atalman

fbshipit-source-id: ed3d7f20e2872e2aa677d4d36abed7d73e9bbae1
pruthvistony added a commit to ROCm/pytorch that referenced this issue Nov 24, 2022
pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 2, 2022
pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 2, 2022
jithunnair-amd added a commit to ROCm/pytorch that referenced this issue Mar 8, 2023
jithunnair-amd added a commit to ROCm/pytorch that referenced this issue Mar 8, 2023
pruthvistony added a commit to ROCm/pytorch that referenced this issue Mar 10, 2023
…ch#77139 (pytorch#83623)""

- This reverts commit 8f73708.
- SWDEV-386723 - superbench perf regression.
pruthvistony added a commit to ROCm/pytorch that referenced this issue Mar 13, 2023
…ch#77139 (pytorch#83623)"" (#1199)

- This reverts commit 8f73708.
- SWDEV-386723 - superbench perf regression.
pruthvistony added a commit to ROCm/pytorch that referenced this issue May 3, 2023
jataylo pushed a commit to ROCm/pytorch that referenced this issue Jul 12, 2023
…ch#77139 (pytorch#83623)"" (#1199)

- This reverts commit 8f73708.
- SWDEV-386723 - superbench perf regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: dataloader Related to torch.utils.data.DataLoader and Sampler module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants