Skip to content
This repository has been archived by the owner on Jul 18, 2022. It is now read-only.
This repository has been archived by the owner on Jul 18, 2022. It is now read-only.

Error "illegal memory access was encountered" during U-Net training #1

Closed
davidkh1 opened this issue Aug 1, 2016 · 15 comments
Closed

Comments

@davidkh1
Copy link

davidkh1 commented Aug 1, 2016

Thanks so much for sharing your code! I'm trying to run it from the start, but have a problem during training phase. Appreciate you support in finding a root cause.

The command I run to train U-NET, paths are adjusted for the defaults:
$ th main.lua
produces error log

Setting up data loader using data/train.h5  
Data loader setup done! 
...
Epoch : 1, Learning Rate : 1.00000  
THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCStorage.c:147

Environment: Ubuntu 14.04, Titan X, CUDA 7.5, cuDNN v.5

Possible root causes:

  1. I tried to temporary remove SpatialMaxPooling module, following this discussion https://groups.google.com/forum/m/#!msg/torch7/Ru-I6vP2ql0/s2vOsKoVBgAJ
    Finally, I simplified the NN to include no modules, but the problem persists. So, the SpatialMaxPooling is not problematic.
  2. I think that the created dataset in hdf5 format has some problems. I'll try to check its correctness. If you know how to check correctness, please advice.
  3. I recently switched to cuDNN v.5. Could this version be problematic?

Thanks!

@chsasank
Copy link
Contributor

chsasank commented Aug 1, 2016

  1. I recently switched to cuDNN v.5. Could this version be problematic?

Did you also update cudnn torch package after upgrading?

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

Yes, I even reinstalled Torch from the scratch.
I'm going to verify integrity of png image files, and then hdf5 file I've created. If no problems, I'll downgrade to cuDNN v.4 and reinstall torch with cudnn.

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

I've verified the integrity of png image files with pngcheck, both train and mask images were OK.
For verifying hdf5 input file I use hdfview and h5stat. Does it look correct?
image

$ h5stat train.h5

Filename: train.h5
File information
    # of unique groups: 1
    # of unique datasets: 94
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 94
...
Summary of file space information:
  File metadata: 35888 bytes
  Raw data: 10981488000 bytes
  Unaccounted space: 3888 bytes
Total space: 10981527776 bytes

How big is a memory usage during U-Net training?

@shubhamjain0594
Copy link
Contributor

With default configuration of 64 batch size it takes around 6GB of GPU Space. And around 8 GB of Memory space.

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

I have enough memory on my GPU - 12G. I tried batch size of 1, but got the same problem.

@shubhamjain0594
Copy link
Contributor

require 'nn'
require 'cunn'

softOutCalc = nn.Sequential():add(nn.SpatialSoftMax())
softOutCalc = softOutCalc:cuda()
ips = torch.rand(8,2,80,80)
ips = ips:cuda()
softOutCalc:forward(ips)

Can you check if this piece of code works?

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

The same error!
th> softOutCalc:forward(ips)

THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: /home/david/torch/install/share/lua/5.1/torch/Tensor.lua:201: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c:81

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

I've run test.sh from torch installation, and have FAILED tests.

 33/154 frac2 ........................................................... [PASS]
 34/154 trace ........................................................... [WAIT]{
  input : FloatTensor - size: 487x402
}
 34/154 trace ........................................................... [FAIL]

....

 96/152 SpatialDilatedConvolution_backward_single ....................... [FAIL]

Googling for this.

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

These 2 failed tests seem unimportant.

@shubhamjain0594
Copy link
Contributor

torch/cunn#292 - I had similar problem. The issue is with the installation of torch and cunn and its libraries. It got solved as we upgraded all the drivers. You can reopen the issue and let us see if we get any support.

@davidkh1
Copy link
Author

davidkh1 commented Aug 1, 2016

Could you tell the version of drivers you use, please?

@preethamsp
Copy link

preethamsp commented Aug 2, 2016

Our Setup: Ubuntu 14.04, TITAN X, CUDA 7.5, CuDNN V5 and nvidia drivers with

version:        361.45.11

@davidkh1
Copy link
Author

davidkh1 commented Aug 2, 2016

I use exactly the same setup, except older driver v. 352.93. Going to update. Thanks for your significant assist in finding the root cause of the problem. Looking forward to run U-Net training.

chsasank added a commit that referenced this issue Aug 2, 2016
@chsasank
Copy link
Contributor

chsasank commented Aug 2, 2016

I too got the same error even with the latest driver. We have reopened cunn issue.

I have made a quick fix/hack to get it working. Can you check if it works now?

@davidkh1
Copy link
Author

davidkh1 commented Aug 2, 2016

Your fix helped, without Nvidia driver update. Thanks!

@chsasank chsasank closed this as completed Aug 2, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants