-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch crashes while training a simple MNIST classification problem #37336
Comments
The error code points to a memory access violation. |
Thanks for your answer. Before giving this a try, how to explain that the same MNIST classification in Libttorch gives no problem on the same environment. I try to understand what is happening before making changes to the system. Thx. |
Just reinstalled a previous vesrion of cuda: |
Have you copied all the libtorch DLLs to your the location of the target executable? |
That is inference only, right? You're exercising more code in your crash case, since you're hitting the autograd engine |
To ezyang: Thx, in libtorch (Cpp version of the MNIST classification) both training (hitting the autograd engine) and inference do work fine. The autograd issue occurs only in Python. |
To peterjc123: Thx for the suggestion. All static and dynamic libs seemed to be present. To be sure I deleted the Torch and Torchvision in venv\Lib\site-packages\ directory, installed again via pip the latest version, but still the same. What I do not understand is that in the Python environment some dll are labelled with a 92 "tag" (9.2 version I suppose) while when I look in the libtorch distribution which I use in Cpp (which works fine) the same dll are tagged with '10'. I thought the python version was relying on the libtorch libraries, perhaps I'm wrong. To clarify, the libtorch version I use in Cpp is different from the version in Python. In Python it is the latest stable version downloaded from the Pytorch site. In Cpp I first tried with the latest version available on the Pytorch site, but ran into problems also as the program was not able to detect the GPU, downgrading to an older 1.4.0 version (which I got from someone), everything works fine. I tried to copy these dll in the Python version but did not manage to make it work. |
Because you installed the CUDA 9.2 variant of the package.
Well, this depends on what
You need an additional linker flag
Well you have CUDA 10.2 installed locally, so you should use the CUDA 10.2 variant of LibTorch. Otherwise, you have different variants of CUDA in one executable, which absolutely won't work. Overall suggestion:
|
@peterjc123 : I followed your steps, without success. With the pip install link you gave me (which is the same I used initially to install pytorch) the same problem occurs, also looking in the site_packages/torch/lib directory, some dll were still tagged with 92. So I uninstalled torch again and had a try with : Now everything works fine, training of both the MNIST example (as a test) and my own project based on a resnet architecture is not a problem anymore. So for me, this item can be closed, but I am wondering if there is no issue with the 'stable' link on the site as clearly it installs torch+cu92. |
@Niohori So would you please try |
Reproduced locally. C:\Users\peter>pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.5.0
Downloading https://download.pytorch.org/whl/cu102/torch-1.5.0-cp37-cp37m-win_amd64.whl (899.1MB)
| | 61kB 75kB/s eta 3:18:16
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
C:\Users\peter>pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.0
Downloading https://download.pytorch.org/whl/cu92/torch-1.5.0%2Bcu92-cp37-cp37m-win_amd64.whl (693.1MB)
| | 204kB 211kB/s eta 0:54:42
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command. Would you please fix the install commands? cc @seemethere @soumith |
Interesting, I wonder why it would actually do that. Is there some special rule about 3 equals signs vs 2 equals signs? |
It seems two equal signs refer to a prefix match and three equal signs refer to a strict match. https://www.python.org/dev/peps/pep-0440/#arbitrary-equality |
reverted to three |
OK, thx for the support: it works |
馃悰 Bug
Attempting to train on a GPU a basic MNIST classification, results in a runtime exception
"Process finished with exit code -1073741819 (0xC0000005)"
Making inferences with the model runs without any problem.
Debugging finishes at the function autograd/-init-.py line98 Variable._execution_engine.run_backward(args), where the systems crashes.
Tracing the memory of the RTX2080Ti GPU shows that the memory of the GPU is not accessed (which is the case during inference).
To Reproduce
To be sure the error did result from my code, I used the example from https://github.com/pytorch/examples/tree/master/mnist
Expected behavior
No crashes
Environment
PyTorch version: 1.5.0+cu92
Is debug build: No
CUDA used to build PyTorch: 9.2
OS: Microsoft Windows 10 Home
GCC version: Could not collect
CMake version: version 3.16.0
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll
Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.4.3
[pip3] numpy==1.16.2
[pip3] torch==1.5.0+cu92
[pip3] torchvision==0.6.0+cu92
[conda] Could not collect
PS C:\Users\Bufo> nvidia-smi.exe
Mon Apr 27 07:56:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.87 Driver Version: 445.87 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===================================================|
| 0 GeForce RTX 208... WDDM | 00000000:1C:00.0 On | N/A |
| 0% 36C P8 11W / 300W | 469MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
Additional context
cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @peterjc123 @nbcsm @guyang3532
The text was updated successfully, but these errors were encountered: