-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small CPU model forward pass extremely slow #13757
Comments
Thanks for reporting this, especially the repro code. The issue is that the MKL library (which we use for matrix-multiplication) is creating and destroying threads for every call to For now, you can disable this by setting the environment variable @soumith perhaps we should set |
Also, this looks like a problem particular to GNU's OpenMP. Clang/LLVM's implementation doesn't seem to suffer from. |
This was fixed in #13868 |
@colesbury, even after It was observed that disabling dynamic-scaling of OpenMP threads fixed the issue, so that workaround was used. However, the reason as to why unused OpenMP threads were exiting seems to be unknown, and might be of interest to you. In PyTorch 1.4.0, the threads of the OpenMP thread-pool were actually created when desired. For instance, only one additional OpenMP thread was created (so, a total of 2 threads in the OpenMP thread-pool) for the following snippet of code:
On the other hand, the current implementation creates the threads of the OpenMP thread pool when the first parallel operation is to be performed, so it'd create threads equal to the number of physical cores, but use only 2 for this snippet. Do you happen to have any insights on why MKL could be causing an issue in #32008, and why disabling auto-scaling of OpenMP threads would've fixed the issue? I tried to reproduce the issue but couldn't. Thank you! |
@imaginary-person if this is still affecting master I suggest putting this in a new issue. |
Thanks for your response, @ezyang! @peterbell10 reported today in #52815 that if dynamic scaling of OpenMP threads (not assigning wasteful work to some threads of the OpenMP thread-pool) is enabled in I'll try to get access to an AMD machine so that I can try to figure out why that's happening & will submit a GNU bug report or patch, and if necessary, will submit an issue here. Thank you! |
Issue description
I have a 80x256x256x10 FC network I'm using for policy gradient, but when I do a forward pass on it it takes 70-100ms (!!) to execute. Puzzling-ly, after some condition that I can't identify, the forward pass suddenly speeds up permanently to < 1-3ms (I couldn't get a reproducible example of that though). Exiting the code during execution seems to imply that it's spending a lot of time in
torch.addmm(bias, input, weight.t())
. Addingpytorch.set_num_threads(1)
seems to fix it, but I don't know why. Sorry if this is a duplicate (I suspect it may be) but I looked and couldn't find anything in the issuesCode example
Here's a (sort of) minimal example that reproduces the slowness:
And here's an iPython stacktrace from randomly exiting that code while it's running:
System Info
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: TITAN X (Pascal)
Nvidia driver version: 390.77
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy (1.15.3)
[pip3] torch (1.0.0a0+c029c83)
[pip3] torchvision (0.2.1)
[conda] pytorch 0.4.0 py36_cuda8.0.61_cudnn7.1.2_1 pytorch
[conda] pytorch-cpu 0.4.1 py36_cpu_1 pytorch
[conda] torchfile 0.1.0
[conda] torchvision-cpu 0.2.1 py36_1 pytorch
The text was updated successfully, but these errors were encountered: