Error resuming Diffusion-InsGen #6

GilesBathgate · 2022-10-17T14:48:48Z

  File "train.py", line 613, in <module>                                                            
    main() # pylint: disable=no-value-for-parameter                                                                                                                                                     
  File "diffusion-gan/diffusion-stylegan2/.env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__                                                                  
    return self.main(*args, **kwargs)                                                               
  File "diffusion-gan/diffusion-stylegan2/.env/lib/python3.8/site-packages/click/core.py", line 1055, in main                                                                      
    rv = self.invoke(ctx)                                                                           
  File "diffusion-gan/diffusion-stylegan2/.env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke                                                                    
    return ctx.invoke(self.callback, **ctx.params)                                                  
  File "diffusion-gan/diffusion-stylegan2/.env/lib/python3.8/site-packages/click/core.py", line 760, in invoke                                                                     
    return __callback(*args, **kwargs)                                                              
  File "diffusion-gan/diffusion-stylegan2/.env/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func                                                              
    return f(get_current_context(), *args, **kwargs)                                                
  File "train.py", line 606, in main                                                                
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)                                             
  File "train.py", line 432, in subprocess_fn                                                       
    training_loop.training_loop(rank=rank, **args)                                                  
  File "diffusion-gan/diffusion-insgen/training/training_loop.py", line 193, in training_loop                                                                                      
    misc.copy_params_and_buffers(resume_data[name], module, require_all=False)                      
KeyError: 'D_ema'

I think the fix I just to add:

--- a/diffusion-insgen/training/training_loop.py
+++ b/diffusion-insgen/training/training_loop.py
@@ -417,21 +417,21 @@ def training_loop(
         # Save network snapshot.
         snapshot_pkl = None
         snapshot_data = None
         if (network_snapshot_ticks is not None) and (done or cur_tick % network_snapshot_ticks == 0):
             snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
-            for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe)]:
+            for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead), ('augment_pipe', augment_pipe)]:

But then the following error occurs:

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Furthermore this issue also appears to be present in the upstream version of insgen.

The text was updated successfully, but these errors were encountered:

Zhendong-Wang · 2022-10-17T16:09:46Z

Hi @GilesBathgate,

Your fix is correct. The following error for resuming is originated from InsGen: genforce/insgen#6. I tried to fix it but I didn't work it out. I guess the error is from the DHead and GHead. I don't know where they did an in-place operation. Need to wait InsGen authors to sovle this, lol...

Zhendong-Wang · 2022-11-02T19:26:07Z

Thanks a lot! Will make the change.

GilesBathgate · 2022-11-15T10:31:30Z

The fix should probably be this:

--- a/diffusion-insgen/training/training_loop.py
+++ b/diffusion-insgen/training/training_loop.py
@@ -154,22 +154,22 @@ def training_loop(
 
     # Construct networks.
     if rank == 0:
         print('Constructing networks...')
     common_kwargs = dict(c_dim=training_set.label_dim, img_resolution=training_set.resolution, img_channels=training_set.num_channels)
     G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(device) # subclass of torch.nn.Module
     D = dnnlib.util.construct_class_by_name(**D_kwargs, **common_kwargs).train().requires_grad_(False).to(device) # subclass of torch.nn.Module
     G_ema = copy.deepcopy(G).eval()
 
     # Construct contrastive heads.
-    DHead = dnnlib.util.construct_class_by_name(**DHead_kwargs).train().to(device) if DHead_kwargs is not None else None
-    GHead = dnnlib.util.construct_class_by_name(**GHead_kwargs).train().to(device) if GHead_kwargs is not None else None
+    DHead = dnnlib.util.construct_class_by_name(**DHead_kwargs).train().requires_grad_(False).to(device) if DHead_kwargs is not None else None
+    GHead = dnnlib.util.construct_class_by_name(**GHead_kwargs).train().requires_grad_(False).to(device) if GHead_kwargs is not None else None
     D_ema = copy.deepcopy(D).eval()
 
     # Setup augmentation.


@@ -221,6 +224,8 @@ def training_loop(
             ddp_modules[name] = module
 
     # Distribute Heads across GPUs.
+    DHead.requires_grad_(True)
+    GHead.requires_grad_(True)
     if rank == 0:
         print(f'Distributing Contrastive Heads across {num_gpus} GPUS...')
     if num_gpus > 1:

This seems to fit the intent of the original stylegan code better.

Zhendong-Wang · 2022-11-15T17:25:47Z

@GilesBathgate Really appreciate your investigation here 💯. I will test the code and update accordingly.

Zhendong-Wang · 2023-03-27T20:20:39Z

This fix seems not working when saving ckpts? Do you know what could be the possible reason @GilesBathgate ?

Distributing across 2 GPUs...
Distributing Contrastive Heads across 2 GPUS...
Setting up training phases...
Setting up contrastive training phases...
Exporting sample images...
Initializing logs...
Skipping tfevents export: No module named 'tensorboard'
Training for 25000 kimg...

tick 0     kimg 0.1      time 18s          sec/tick 5.9     sec/kimg 92.60   maintenance 11.7   cpumem 4.10   gpumem 11.92  augment 0.000 T 10.0
Traceback (most recent call last):
  File "train.py", line 603, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 598, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/zdwang/.conda/envs/difgan/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/zdwang/Research/Diffusion-GAN/diffusion-insgen/train.py", line 422, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/home/zdwang/Research/Diffusion-GAN/diffusion-insgen/training/training_loop.py", line 432, in training_loop
    misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
  File "/home/zdwang/Research/Diffusion-GAN/diffusion-insgen/torch_utils/misc.py", line 180, in check_ddp_consistency
    assert (nan_to_num(tensor) == nan_to_num(other)).all(), fullname
AssertionError: DistributedDataParallel.module.mlp.0.weight

GilesBathgate · 2023-03-27T21:46:37Z

I only have 1 GPU so was not using in distributed mode. So perhaps that's why. I had to make another patch for support only 1 gpu

Zhendong-Wang · 2023-03-28T00:36:54Z

Thanks @GilesBathgate ! I remembered that you have one another fix, which finds some in-place operation of InsGen in its Constrastive_Head. The fix was deleted (I don't know. ...). Do you mind share it agian and I can try that one. I didn't find where it is, lol. Thanks again!

GilesBathgate · 2023-03-28T08:04:26Z

I proposed a change here:

Diffusion-GAN/diffusion-insgen/torch_utils/misc.py

Line 153 in 2ca5e7a

tensor.copy_(src_tensors[name].detach()).requires_grad_(tensor.requires_grad)

Essentially disable grad before copying then re-enable. However, I prefer the fix above, which should have the same effect as grad should not be enabled before misc.copy_params_and_buffers is called, as is the case for the other modules.

I don't think either of these fixes will solve your error in misc.check_ddp_consistency

GilesBathgate · 2023-03-29T16:01:05Z

--- a/diffusion-insgen/torch_utils/misc.py
+++ b/diffusion-insgen/torch_utils/misc.py
@@ -150,7 +150,9 @@ def copy_params_and_buffers(src_module, dst_module, require_all=False):
     for name, tensor in named_params_and_buffers(dst_module):
         assert (name in src_tensors) or (not require_all)
         if name in src_tensors:
-            tensor.copy_(src_tensors[name].detach()).requires_grad_(tensor.requires_grad)
+            requires_grad = tensor.requires_grad
+            with torch.no_grad():
+                tensor.copy_(src_tensors[name].detach())
+            tensor.requires_grad_(requires_grad)

GilesBathgate closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error resuming Diffusion-InsGen #6

Error resuming Diffusion-InsGen #6

GilesBathgate commented Oct 17, 2022 •

edited

Loading

Zhendong-Wang commented Oct 17, 2022

Zhendong-Wang commented Nov 2, 2022

GilesBathgate commented Nov 15, 2022 •

edited

Loading

Zhendong-Wang commented Nov 15, 2022 •

edited

Loading

Zhendong-Wang commented Mar 27, 2023 •

edited

Loading

GilesBathgate commented Mar 27, 2023

Zhendong-Wang commented Mar 28, 2023 •

edited

Loading

GilesBathgate commented Mar 28, 2023 •

edited

Loading

GilesBathgate commented Mar 29, 2023 •

edited

Loading

Error resuming Diffusion-InsGen #6

Error resuming Diffusion-InsGen #6

Comments

GilesBathgate commented Oct 17, 2022 • edited Loading

Zhendong-Wang commented Oct 17, 2022

Zhendong-Wang commented Nov 2, 2022

GilesBathgate commented Nov 15, 2022 • edited Loading

Zhendong-Wang commented Nov 15, 2022 • edited Loading

Zhendong-Wang commented Mar 27, 2023 • edited Loading

GilesBathgate commented Mar 27, 2023

Zhendong-Wang commented Mar 28, 2023 • edited Loading

GilesBathgate commented Mar 28, 2023 • edited Loading

GilesBathgate commented Mar 29, 2023 • edited Loading

GilesBathgate commented Oct 17, 2022 •

edited

Loading

GilesBathgate commented Nov 15, 2022 •

edited

Loading

Zhendong-Wang commented Nov 15, 2022 •

edited

Loading

Zhendong-Wang commented Mar 27, 2023 •

edited

Loading

Zhendong-Wang commented Mar 28, 2023 •

edited

Loading

GilesBathgate commented Mar 28, 2023 •

edited

Loading

GilesBathgate commented Mar 29, 2023 •

edited

Loading