Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch crashes while training a simple MNIST classification problem #37336

Closed
Niohori opened this issue Apr 27, 2020 · 15 comments
Closed

Pytorch crashes while training a simple MNIST classification problem #37336

Niohori opened this issue Apr 27, 2020 · 15 comments
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Niohori
Copy link

Niohori commented Apr 27, 2020

馃悰 Bug

Attempting to train on a GPU a basic MNIST classification, results in a runtime exception
"Process finished with exit code -1073741819 (0xC0000005)"
Making inferences with the model runs without any problem.
Debugging finishes at the function autograd/-init-.py line98 Variable._execution_engine.run_backward(args), where the systems crashes.
Tracing the memory of the RTX2080Ti GPU shows that the memory of the GPU is not accessed (which is the case during inference).

To Reproduce

To be sure the error did result from my code, I used the example from https://github.com/pytorch/examples/tree/master/mnist

Expected behavior

No crashes

Environment

PyTorch version: 1.5.0+cu92
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 Home
GCC version: Could not collect
CMake version: version 3.16.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.4.3
[pip3] numpy==1.16.2
[pip3] torch==1.5.0+cu92
[pip3] torchvision==0.6.0+cu92
[conda] Could not collect

PS C:\Users\Bufo> nvidia-smi.exe
Mon Apr 27 07:56:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.87 Driver Version: 445.87 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===================================================|
| 0 GeForce RTX 208... WDDM | 00000000:1C:00.0 On | N/A |
| 0% 36C P8 11W / 300W | 469MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

Additional context

cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @peterjc123 @nbcsm @guyang3532

@ptrblck
Copy link
Collaborator

ptrblck commented Apr 27, 2020

The error code points to a memory access violation.
Since you are using an RTX GPU, could you install the binaries with CUDA10.1 or 10.2 and rerun the code, please?

@Niohori
Copy link
Author

Niohori commented Apr 27, 2020

Thanks for your answer. Before giving this a try, how to explain that the same MNIST classification in Libttorch gives no problem on the same environment. I try to understand what is happening before making changes to the system. Thx.

@Niohori
Copy link
Author

Niohori commented Apr 27, 2020

Just reinstalled a previous vesrion of cuda:
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
Still crashing with the same error code. Definitely, something else ...

@mrshenli mrshenli added module: autograd Related to torch.autograd, and the autograd engine in general module: regression It used to work, and now it doesn't module: crash Problem manifests as a hard crash, as opposed to a RuntimeError triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority labels Apr 27, 2020
@peterjc123
Copy link
Collaborator

Have you copied all the libtorch DLLs to your the location of the target executable?

@ezyang ezyang added the module: windows Windows support for PyTorch label Apr 27, 2020
@ezyang
Copy link
Contributor

ezyang commented Apr 27, 2020

Thanks for your answer. Before giving this a try, how to explain that the same MNIST classification in Libttorch gives no problem on the same environment.

That is inference only, right? You're exercising more code in your crash case, since you're hitting the autograd engine

@Niohori
Copy link
Author

Niohori commented Apr 27, 2020

To ezyang: Thx, in libtorch (Cpp version of the MNIST classification) both training (hitting the autograd engine) and inference do work fine. The autograd issue occurs only in Python.

@Niohori
Copy link
Author

Niohori commented Apr 27, 2020

To peterjc123: Thx for the suggestion. All static and dynamic libs seemed to be present. To be sure I deleted the Torch and Torchvision in venv\Lib\site-packages\ directory, installed again via pip the latest version, but still the same. What I do not understand is that in the Python environment some dll are labelled with a 92 "tag" (9.2 version I suppose) while when I look in the libtorch distribution which I use in Cpp (which works fine) the same dll are tagged with '10'. I thought the python version was relying on the libtorch libraries, perhaps I'm wrong. To clarify, the libtorch version I use in Cpp is different from the version in Python. In Python it is the latest stable version downloaded from the Pytorch site. In Cpp I first tried with the latest version available on the Pytorch site, but ran into problems also as the program was not able to detect the GPU, downgrading to an older 1.4.0 version (which I got from someone), everything works fine. I tried to copy these dll in the Python version but did not manage to make it work.

@peterjc123
Copy link
Collaborator

peterjc123 commented Apr 28, 2020

@Niohori

What I do not understand is that in the Python environment some dll are labelled with a 92 "tag" (9.2 version I suppose) while when I look in the libtorch distribution which I use in Cpp (which works fine) the same dll are tagged with '10'.

Because you installed the CUDA 9.2 variant of the package.

I thought the python version was relying on the libtorch libraries, perhaps I'm wrong.

Well, this depends on what libtorch are you referring to. PyTorch relies on the cpp extension _C.pyd and it relies on the DLLs in the lib directory. (e.g. venv\Lib\site-packages\torch\lib) If you mean the one downloaded by yourself, then the answer is no.

In Cpp I first tried with the latest version available on the Pytorch site, but ran into problems also as the program was not able to detect the GPU

You need an additional linker flag -INCLUDE:?warp_size@cuda@at@@YAHXZ. Also make sure you have the latest GPU driver.

I tried to copy these dll in the Python version but did not manage to make it work.

Well you have CUDA 10.2 installed locally, so you should use the CUDA 10.2 variant of LibTorch. Otherwise, you have different variants of CUDA in one executable, which absolutely won't work.

Overall suggestion:

  1. Replace cu92 with cu102 variant for both LibTorch and PyTorch

    PyTorch:

    pip uninstall torch
    pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html

    LibTorch:

    Windows binaries do not support Java. Support is only available for Linux. Download here for             
    C++ (Release version):
    https://download.pytorch.org/libtorch/cu102/libtorch-win-shared-with-deps-1.5.0.zip
    
    Download here for C++ (Debug version):
    https://download.pytorch.org/libtorch/cu102/libtorch-win-shared-with-deps-debug-1.5.0.zip
    
  2. Make sure you GPU driver is installed and up to date

  3. Construct the project using CMake and don't create it manually. Otherwise, you'll need to pass -INCLUDE:?warp_size@cuda@at@@YAHXZ as an additional linker flag in 1.5.0.

    cmake -DCMAKE_PREFIX_PATH="[absolute path to libtorch]" ..
    cmake --build . --config [Release or Debug] 
    

@Niohori
Copy link
Author

Niohori commented Apr 28, 2020

@peterjc123 : I followed your steps, without success. With the pip install link you gave me (which is the same I used initially to install pytorch) the same problem occurs, also looking in the site_packages/torch/lib directory, some dll were still tagged with 92. So I uninstalled torch again and had a try with :
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html

Now everything works fine, training of both the MNIST example (as a test) and my own project based on a resnet architecture is not a problem anymore.

So for me, this item can be closed, but I am wondering if there is no issue with the 'stable' link on the site as clearly it installs torch+cu92.

@peterjc123
Copy link
Collaborator

peterjc123 commented Apr 28, 2020

@Niohori So would you please try pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html?

@peterjc123
Copy link
Collaborator

Reproduced locally.

C:\Users\peter>pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.5.0
  Downloading https://download.pytorch.org/whl/cu102/torch-1.5.0-cp37-cp37m-win_amd64.whl (899.1MB)
     |                                | 61kB 75kB/s eta 3:18:16
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

C:\Users\peter>pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.0
  Downloading https://download.pytorch.org/whl/cu92/torch-1.5.0%2Bcu92-cp37-cp37m-win_amd64.whl (693.1MB)
     |                                | 204kB 211kB/s eta 0:54:42
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Would you please fix the install commands? cc @seemethere @soumith

@seemethere
Copy link
Member

Interesting, I wonder why it would actually do that.

Is there some special rule about 3 equals signs vs 2 equals signs?

@peterjc123
Copy link
Collaborator

peterjc123 commented Apr 28, 2020

It seems two equal signs refer to a prefix match and three equal signs refer to a strict match. https://www.python.org/dev/peps/pep-0440/#arbitrary-equality
And with string comparison, 1.5.0+cu92 > 1.5.0+cu102 > 1.5.0+cu101 > 1.5.0+cpu > 1.5.0, so it will be picked by specifying ==1.5.0.

@soumith
Copy link
Member

soumith commented Apr 28, 2020

reverted to three ===: pytorch/pytorch.github.io#372

@Niohori
Copy link
Author

Niohori commented Apr 28, 2020

@Niohori So would you please try pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html?

OK, thx for the support: it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants