Error "illegal memory access was encountered" during U-Net training #1

davidkh1 · 2016-08-01T13:37:56Z

Thanks so much for sharing your code! I'm trying to run it from the start, but have a problem during training phase. Appreciate you support in finding a root cause.

The command I run to train U-NET, paths are adjusted for the defaults:
$ th main.lua
produces error log

Setting up data loader using data/train.h5  
Data loader setup done! 
...
Epoch : 1, Learning Rate : 1.00000  
THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCStorage.c:147

Environment: Ubuntu 14.04, Titan X, CUDA 7.5, cuDNN v.5

Possible root causes:

I tried to temporary remove SpatialMaxPooling module, following this discussion https://groups.google.com/forum/m/#!msg/torch7/Ru-I6vP2ql0/s2vOsKoVBgAJ
Finally, I simplified the NN to include no modules, but the problem persists. So, the SpatialMaxPooling is not problematic.
I think that the created dataset in hdf5 format has some problems. I'll try to check its correctness. If you know how to check correctness, please advice.
I recently switched to cuDNN v.5. Could this version be problematic?

Thanks!

The text was updated successfully, but these errors were encountered:

chsasank · 2016-08-01T13:41:07Z

I recently switched to cuDNN v.5. Could this version be problematic?

Did you also update cudnn torch package after upgrading?

davidkh1 · 2016-08-01T13:52:06Z

Yes, I even reinstalled Torch from the scratch.
I'm going to verify integrity of png image files, and then hdf5 file I've created. If no problems, I'll downgrade to cuDNN v.4 and reinstall torch with cudnn.

davidkh1 · 2016-08-01T14:40:13Z

I've verified the integrity of png image files with pngcheck, both train and mask images were OK.
For verifying hdf5 input file I use hdfview and h5stat. Does it look correct?

$ h5stat train.h5

Filename: train.h5
File information
    # of unique groups: 1
    # of unique datasets: 94
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 94
...
Summary of file space information:
  File metadata: 35888 bytes
  Raw data: 10981488000 bytes
  Unaccounted space: 3888 bytes
Total space: 10981527776 bytes

How big is a memory usage during U-Net training?

shubhamjain0594 · 2016-08-01T15:08:25Z

With default configuration of 64 batch size it takes around 6GB of GPU Space. And around 8 GB of Memory space.

davidkh1 · 2016-08-01T15:24:01Z

I have enough memory on my GPU - 12G. I tried batch size of 1, but got the same problem.

shubhamjain0594 · 2016-08-01T15:57:59Z

require 'nn'
require 'cunn'

softOutCalc = nn.Sequential():add(nn.SpatialSoftMax())
softOutCalc = softOutCalc:cuda()
ips = torch.rand(8,2,80,80)
ips = ips:cuda()
softOutCalc:forward(ips)

Can you check if this piece of code works?

davidkh1 · 2016-08-01T16:26:02Z

The same error!
th> softOutCalc:forward(ips)

THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: /home/david/torch/install/share/lua/5.1/torch/Tensor.lua:201: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c:81

davidkh1 · 2016-08-01T16:34:36Z

I've run test.sh from torch installation, and have FAILED tests.

 33/154 frac2 ........................................................... [PASS]
 34/154 trace ........................................................... [WAIT]{
  input : FloatTensor - size: 487x402
}
 34/154 trace ........................................................... [FAIL]

....

 96/152 SpatialDilatedConvolution_backward_single ....................... [FAIL]

Googling for this.

davidkh1 · 2016-08-01T16:42:28Z

These 2 failed tests seem unimportant.

shubhamjain0594 · 2016-08-01T18:00:03Z

torch/cunn#292 - I had similar problem. The issue is with the installation of torch and cunn and its libraries. It got solved as we upgraded all the drivers. You can reopen the issue and let us see if we get any support.

davidkh1 · 2016-08-01T18:25:26Z

Could you tell the version of drivers you use, please?

preethamsp · 2016-08-02T06:30:42Z

Our Setup: Ubuntu 14.04, TITAN X, CUDA 7.5, CuDNN V5 and nvidia drivers with

version:        361.45.11

davidkh1 · 2016-08-02T06:41:37Z

I use exactly the same setup, except older driver v. 352.93. Going to update. Thanks for your significant assist in finding the root cause of the problem. Looking forward to run U-Net training.

to fix Issue : #1

chsasank · 2016-08-02T08:03:47Z

I too got the same error even with the latest driver. We have reopened cunn issue.

I have made a quick fix/hack to get it working. Can you check if it works now?

davidkh1 · 2016-08-02T08:43:28Z

Your fix helped, without Nvidia driver update. Thanks!

chsasank added a commit that referenced this issue Aug 2, 2016

Changed SpatialSoftMax from nn to cudnn

7190c10

to fix Issue : #1

chsasank closed this as completed Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error "illegal memory access was encountered" during U-Net training #1

Error "illegal memory access was encountered" during U-Net training #1

davidkh1 commented Aug 1, 2016

chsasank commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

preethamsp commented Aug 2, 2016 •

edited

Loading

davidkh1 commented Aug 2, 2016

chsasank commented Aug 2, 2016

davidkh1 commented Aug 2, 2016

Error "illegal memory access was encountered" during U-Net training #1

Error "illegal memory access was encountered" during U-Net training #1

Comments

davidkh1 commented Aug 1, 2016

chsasank commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

shubhamjain0594 commented Aug 1, 2016

davidkh1 commented Aug 1, 2016

preethamsp commented Aug 2, 2016 • edited Loading

davidkh1 commented Aug 2, 2016

chsasank commented Aug 2, 2016

davidkh1 commented Aug 2, 2016

preethamsp commented Aug 2, 2016 •

edited

Loading