RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

NeuralBricolage · 2021-05-04T12:47:58Z

Traceback (most recent call last):
File "train.py", line 43, in
model.data_dependent_initialize(data)
File "/home/helena/CUT/models/cut_model.py", line 108, in data_dependent_initialize
self.compute_G_loss().backward() # calculate graidents for G
File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

hello, i'm aware that this issue was already brought up and the suggestion was to downgrade to PyTorch 1.4 which i'm trying to avoid being on CUDA 11
what i find interesting though that cycleGAN training works just fine with the same setup (CUDA 11.1, PyTorch 1.8) and on the same dataset
any suggestions how to debug are welcome

layer19 · 2021-05-14T15:26:27Z

Got exactly the same problem on PyTorch 1.8 and Cuda 11.1 (trying to run default FastCUT train from example). Downgrading to PyTorch 1.4 and Cuda 9.2 doesn't help and leads to:

root@d63a8e8c3efe:/usr/src/CUT# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

root@d63a8e8c3efe:/usr/src/CUT# CUDA_VISIBLE_DEVICES=0 python3 train.py  --gpu_ids 0 --dataroot ./datasets/grumpifycat --name grumpifycat_FastCUT --CUT_mode FastCUT --verbose --num_threads 0
----------------- Options ---------------
                 CUT_mode: FastCUT                              [default: CUT]
               batch_size: 1                             
                    beta1: 0.5                           
                    beta2: 0.999                         
          checkpoints_dir: ./checkpoints                 
           continue_train: False                         
                crop_size: 256                           
                 dataroot: ./datasets/grumpifycat               [default: placeholder]
             dataset_mode: unaligned                     
                direction: AtoB                          
              display_env: main                          
             display_freq: 400                           
               display_id: None                          
            display_ncols: 4                             
             display_port: 8097                          
           display_server: http://localhost              
          display_winsize: 256                           
               easy_label: experiment_name               
                    epoch: latest                        
              epoch_count: 1                             
          evaluation_freq: 5000                          
        flip_equivariance: True                          
                 gan_mode: lsgan                         
                  gpu_ids: 0                             
                init_gain: 0.02                          
                init_type: xavier                        
                 input_nc: 3                             
                  isTrain: True                                 [default: None]
               lambda_GAN: 1.0                           
               lambda_NCE: 10.0                          
                load_size: 286                           
                       lr: 0.0002                        
           lr_decay_iters: 50                            
                lr_policy: linear                        
         max_dataset_size: inf                           
                    model: cut                           
                 n_epochs: 150                           
           n_epochs_decay: 50                            
               n_layers_D: 3                             
                     name: grumpifycat_FastCUT                  [default: experiment_name]
                    nce_T: 0.07                          
                  nce_idt: False                         
nce_includes_all_negatives_from_minibatch: False                         
               nce_layers: 0,4,8,12,16                   
                      ndf: 64                            
                     netD: basic                         
                     netF: mlp_sample                    
                  netF_nc: 256                           
                     netG: resnet_9blocks                
                      ngf: 64                            
             no_antialias: False                         
          no_antialias_up: False                         
               no_dropout: True                          
                  no_flip: False                         
                  no_html: False                         
                    normD: instance                      
                    normG: instance                      
              num_patches: 256                           
              num_threads: 0                                    [default: 4]
                output_nc: 3
phase: train                         
                pool_size: 0                             
               preprocess: resize_and_crop               
          pretrained_name: None                          
               print_freq: 100                           
         random_scale_max: 3.0                           
             save_by_iter: False                         
          save_epoch_freq: 5                             
         save_latest_freq: 5000                          
           serial_batches: False                         
stylegan2_G_num_downsampling: 1                             
                   suffix:                               
         update_html_freq: 1000                          
                  verbose: True                                 [default: False]
----------------- End -------------------
dataset [UnalignedDataset] was created
model [CUTModel] was created
The number of training images = 214
Setting up a new session...
create web directory ./checkpoints/grumpifycat_FastCUT/web...
Traceback (most recent call last):
  File "train.py", line 43, in <module>
    model.data_dependent_initialize(data)
  File "/usr/src/CUT/models/cut_model.py", line 105, in data_dependent_initialize
    self.forward()                     # compute fake images: G(A)
  File "/usr/src/CUT/models/cut_model.py", line 154, in forward
    self.fake = self.netG(self.real)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/src/CUT/models/networks.py", line 1006, in forward
    fake = self.model(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

benz725 · 2021-05-19T03:13:10Z

have the same error reported.
however, my display card is A100 which is recommended to use CUDA version above 11.0.
So I cannot downgrade the cuda version.
how the author will sovle this problem.

dashu233 · 2021-06-14T13:56:39Z

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

JoshonSmith · 2021-07-02T13:42:01Z

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ！
envs: pytorch 1.8 cuda11
thanks !!!

xinwangxinwang · 2021-07-19T18:16:59Z

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ！
envs: pytorch 1.8 cuda11
thanks !!!

yes, it works.
'patch_id = torch.randperm(feat_reshape.shape[1], device=feats[0].device)' (models/networks.py, lines 565)
patch_id = np.random.permutation(feat_reshape.shape[1])

Thank you!

taesungp · 2021-10-27T19:50:48Z

Thank you for the feedback and solution. I made the suggested change and pushed the code.

ErikValle · 2023-03-03T10:34:57Z

The issue has reappeared, although the previously mentioned patch has been applied. I used the environment.yml to set up a conda environment. Any suggestions?

taesungp closed this as completed Oct 27, 2021

GaloisWang mentioned this issue Mar 31, 2022

RuntimeError： cudaErrorIllegalAddress xinge008/Cylinder3D#77

Closed

Chongjie-Si mentioned this issue Aug 10, 2023

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error' lizhaoliu-Lec/CPCM#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

NeuralBricolage commented May 4, 2021

layer19 commented May 14, 2021

benz725 commented May 19, 2021

dashu233 commented Jun 14, 2021

JoshonSmith commented Jul 2, 2021

xinwangxinwang commented Jul 19, 2021 •

edited

Loading

taesungp commented Oct 27, 2021

ErikValle commented Mar 3, 2023

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

Comments

NeuralBricolage commented May 4, 2021

layer19 commented May 14, 2021

benz725 commented May 19, 2021

dashu233 commented Jun 14, 2021

JoshonSmith commented Jul 2, 2021

xinwangxinwang commented Jul 19, 2021 • edited Loading

taesungp commented Oct 27, 2021

ErikValle commented Mar 3, 2023

xinwangxinwang commented Jul 19, 2021 •

edited

Loading