Pytorch crashes while training a simple MNIST classification problem #37336

Niohori · 2020-04-27T06:03:57Z

🐛 Bug

Attempting to train on a GPU a basic MNIST classification, results in a runtime exception
"Process finished with exit code -1073741819 (0xC0000005)"
Making inferences with the model runs without any problem.
Debugging finishes at the function autograd/-init-.py line98 Variable._execution_engine.run_backward(args), where the systems crashes.
Tracing the memory of the RTX2080Ti GPU shows that the memory of the GPU is not accessed (which is the case during inference).

To Reproduce

To be sure the error did result from my code, I used the example from https://github.com/pytorch/examples/tree/master/mnist

Expected behavior

No crashes

Environment

PyTorch version: 1.5.0+cu92
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 Home
GCC version: Could not collect
CMake version: version 3.16.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.4.3
[pip3] numpy==1.16.2
[pip3] torch==1.5.0+cu92
[pip3] torchvision==0.6.0+cu92
[conda] Could not collect

Additional context

cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @peterjc123 @nbcsm @guyang3532

ptrblck · 2020-04-27T07:05:28Z

The error code points to a memory access violation.
Since you are using an RTX GPU, could you install the binaries with CUDA10.1 or 10.2 and rerun the code, please?

Niohori · 2020-04-27T10:24:31Z

Thanks for your answer. Before giving this a try, how to explain that the same MNIST classification in Libttorch gives no problem on the same environment. I try to understand what is happening before making changes to the system. Thx.

Niohori · 2020-04-27T11:14:36Z

Just reinstalled a previous vesrion of cuda:
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
Still crashing with the same error code. Definitely, something else ...

peterjc123 · 2020-04-27T15:38:23Z

Have you copied all the libtorch DLLs to your the location of the target executable?

ezyang · 2020-04-27T17:10:21Z

Thanks for your answer. Before giving this a try, how to explain that the same MNIST classification in Libttorch gives no problem on the same environment.

That is inference only, right? You're exercising more code in your crash case, since you're hitting the autograd engine

Niohori · 2020-04-27T17:52:41Z

To ezyang: Thx, in libtorch (Cpp version of the MNIST classification) both training (hitting the autograd engine) and inference do work fine. The autograd issue occurs only in Python.

Niohori · 2020-04-27T18:39:45Z

To peterjc123: Thx for the suggestion. All static and dynamic libs seemed to be present. To be sure I deleted the Torch and Torchvision in venv\Lib\site-packages\ directory, installed again via pip the latest version, but still the same. What I do not understand is that in the Python environment some dll are labelled with a 92 "tag" (9.2 version I suppose) while when I look in the libtorch distribution which I use in Cpp (which works fine) the same dll are tagged with '10'. I thought the python version was relying on the libtorch libraries, perhaps I'm wrong. To clarify, the libtorch version I use in Cpp is different from the version in Python. In Python it is the latest stable version downloaded from the Pytorch site. In Cpp I first tried with the latest version available on the Pytorch site, but ran into problems also as the program was not able to detect the GPU, downgrading to an older 1.4.0 version (which I got from someone), everything works fine. I tried to copy these dll in the Python version but did not manage to make it work.

peterjc123 · 2020-04-28T03:13:57Z

@Niohori

What I do not understand is that in the Python environment some dll are labelled with a 92 "tag" (9.2 version I suppose) while when I look in the libtorch distribution which I use in Cpp (which works fine) the same dll are tagged with '10'.

Because you installed the CUDA 9.2 variant of the package.

I thought the python version was relying on the libtorch libraries, perhaps I'm wrong.

Well, this depends on what libtorch are you referring to. PyTorch relies on the cpp extension _C.pyd and it relies on the DLLs in the lib directory. (e.g. venv\Lib\site-packages\torch\lib) If you mean the one downloaded by yourself, then the answer is no.

In Cpp I first tried with the latest version available on the Pytorch site, but ran into problems also as the program was not able to detect the GPU

You need an additional linker flag -INCLUDE:?warp_size@cuda@at@@YAHXZ. Also make sure you have the latest GPU driver.

I tried to copy these dll in the Python version but did not manage to make it work.

Well you have CUDA 10.2 installed locally, so you should use the CUDA 10.2 variant of LibTorch. Otherwise, you have different variants of CUDA in one executable, which absolutely won't work.

Overall suggestion:

Replace cu92 with cu102 variant for both LibTorch and PyTorch

PyTorch:

pip uninstall torch
pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html

LibTorch:

Windows binaries do not support Java. Support is only available for Linux. Download here for             
C++ (Release version):
https://download.pytorch.org/libtorch/cu102/libtorch-win-shared-with-deps-1.5.0.zip

Download here for C++ (Debug version):
https://download.pytorch.org/libtorch/cu102/libtorch-win-shared-with-deps-debug-1.5.0.zip

Make sure you GPU driver is installed and up to date
Construct the project using CMake and don't create it manually. Otherwise, you'll need to pass -INCLUDE:?warp_size@cuda@at@@YAHXZ as an additional linker flag in 1.5.0.
```
cmake -DCMAKE_PREFIX_PATH="[absolute path to libtorch]" ..
cmake --build . --config [Release or Debug] 
```

Niohori · 2020-04-28T05:51:42Z

@peterjc123 : I followed your steps, without success. With the pip install link you gave me (which is the same I used initially to install pytorch) the same problem occurs, also looking in the site_packages/torch/lib directory, some dll were still tagged with 92. So I uninstalled torch again and had a try with :
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html

Now everything works fine, training of both the MNIST example (as a test) and my own project based on a resnet architecture is not a problem anymore.

So for me, this item can be closed, but I am wondering if there is no issue with the 'stable' link on the site as clearly it installs torch+cu92.

peterjc123 · 2020-04-28T06:04:16Z

@Niohori So would you please try pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html?

peterjc123 · 2020-04-28T06:07:45Z

Reproduced locally.

C:\Users\peter>pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.5.0
  Downloading https://download.pytorch.org/whl/cu102/torch-1.5.0-cp37-cp37m-win_amd64.whl (899.1MB)
     |                                | 61kB 75kB/s eta 3:18:16
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

C:\Users\peter>pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.0
  Downloading https://download.pytorch.org/whl/cu92/torch-1.5.0%2Bcu92-cp37-cp37m-win_amd64.whl (693.1MB)
     |                                | 204kB 211kB/s eta 0:54:42
ERROR: Operation cancelled by user
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Would you please fix the install commands? cc @seemethere @soumith

seemethere · 2020-04-28T06:10:29Z

Interesting, I wonder why it would actually do that.

Is there some special rule about 3 equals signs vs 2 equals signs?

peterjc123 · 2020-04-28T06:12:05Z

It seems two equal signs refer to a prefix match and three equal signs refer to a strict match. https://www.python.org/dev/peps/pep-0440/#arbitrary-equality
And with string comparison, 1.5.0+cu92 > 1.5.0+cu102 > 1.5.0+cu101 > 1.5.0+cpu > 1.5.0, so it will be picked by specifying ==1.5.0.

soumith · 2020-04-28T06:25:39Z

reverted to three ===: pytorch/pytorch.github.io#372

Niohori · 2020-04-28T14:51:19Z

@Niohori So would you please try pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html?

OK, thx for the support: it works

pytorch-probot bot added the triage review label Apr 27, 2020

ezyang added the module: windows Windows support for PyTorch label Apr 27, 2020

ezyang removed high priority triage review labels Apr 27, 2020

peterjc123 closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch crashes while training a simple MNIST classification problem #37336

Pytorch crashes while training a simple MNIST classification problem #37336

Niohori commented Apr 27, 2020 •

edited by pytorch-probot bot

ptrblck commented Apr 27, 2020

Niohori commented Apr 27, 2020

Niohori commented Apr 27, 2020

peterjc123 commented Apr 27, 2020

ezyang commented Apr 27, 2020

Niohori commented Apr 27, 2020

Niohori commented Apr 27, 2020

peterjc123 commented Apr 28, 2020 •

edited

Niohori commented Apr 28, 2020

peterjc123 commented Apr 28, 2020 •

edited

peterjc123 commented Apr 28, 2020

seemethere commented Apr 28, 2020

peterjc123 commented Apr 28, 2020 •

edited

soumith commented Apr 28, 2020

Niohori commented Apr 28, 2020

Pytorch crashes while training a simple MNIST classification problem #37336

Pytorch crashes while training a simple MNIST classification problem #37336

Comments

Niohori commented Apr 27, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

ptrblck commented Apr 27, 2020

Niohori commented Apr 27, 2020

Niohori commented Apr 27, 2020

peterjc123 commented Apr 27, 2020

ezyang commented Apr 27, 2020

Niohori commented Apr 27, 2020

Niohori commented Apr 27, 2020

peterjc123 commented Apr 28, 2020 • edited

Niohori commented Apr 28, 2020

peterjc123 commented Apr 28, 2020 • edited

peterjc123 commented Apr 28, 2020

seemethere commented Apr 28, 2020

peterjc123 commented Apr 28, 2020 • edited

soumith commented Apr 28, 2020

Niohori commented Apr 28, 2020

Niohori commented Apr 27, 2020 •

edited by pytorch-probot bot

peterjc123 commented Apr 28, 2020 •

edited

peterjc123 commented Apr 28, 2020 •

edited

peterjc123 commented Apr 28, 2020 •

edited