How to properly deal with 'Offset mean more than 100' Warning #91

TouqeerAhmad · 2019-09-03T16:11:47Z

Hi Xinntao,

Can you please comment on how to specifically deal with the offset warning in the PCD alignment module? I am trying to train an 'L' model and have ambiguities from the details I could piece together from issues #16 and #22. I have put together the workflow required to train an 'L' model below, can you please comment about its correctness.

The original 'M' and 'L' models have 10 and 40 RBs respectively in the reconstruction module and the number of channels are also different i.e. 64 and 128 respectively.
According to your suggestion of initializing a deeper model with a shallower model, you are referring to using an 'M' model (wo TSA) with 128 feature maps and not with 64? -- as the lateral would not work.
Once the 'M' model (wo TSA) with 128 feature maps is trained, we can initialize the 'L' model (wo TSA) using the 'M' model where for each 10 RBs (out of 40) of 'L' model are initialized with 10 RBs from 'M' model.
Then the 'L' model (wo TSA) can be trained, subsequently we can add TSA module -- train only the TSA module first and then the full network as suggest in the yml file for 'M' model,

TouqeerAhmad · 2019-09-04T16:19:43Z

Also, one of the suggestion seems to be using lower learning rate for PCD module -- can you please share the code file to do that. And also what is appropriate learning rate for PCD module?

xinntao · 2019-09-06T14:29:42Z

First of all, we find that 1) we can train 'M' models from scratch without such offset warnings. That is, the training is stable. 2) However, if we train 'L' models, it is very fragile and the offset warnings appear occasionally. If the offset is large than 100, it means that the offsets in dcn are wrongly predicted (too large offsets are meaningless ). The performance of these models is also poor. We think the reason is that when we train the large model with dcn, the offsets in dcn is more fragile.

In the competition, we train such large models from smaller models (from C=64 models to C=128 models and then to B=40 models). Even with such training schemes, we still encounter the wrong/too large offsets. We do not have a nice solution and just stop it and resume from the nearest normal model. Here, normal model means that their offsets are normal and are not too large.

The training procedures in the competitions are complex and actually we do not remember the concrete steps. We now provide the training schemes for the "M" models.
We want to provide a simple and effective way to reproduce the "L" models.
I think I need another two or three weeks to explore such ways.

Right
In the competition, we use a strategy, but we do not remember the concrete steps.
Based on experiences and past experiments, C64B10woTSA -> C128B10woTSA -> C128B40woTSA -> C128B40wTSA seems a reasonable path. But the final performance of this path is unknown. I will also conduct experiments these weeks. If have results, I will update it.
We use a half smaller lr for PCD module during the competition, but I wonder whether this is necessary. Because in the 'M' model training, it is not necessary.
If you encounter the offset warning, just stop it and resume from the nearest normal model.

We are developing more stable and efficient models, but the work is still in progress.

TouqeerAhmad · 2019-09-06T15:03:47Z

Thank you! this clears my doubts.

TouqeerAhmad · 2019-09-10T15:08:30Z

For me it is becoming more an more frequent as the training progresses -- seems to be very unstable.

xinntao · 2019-09-16T15:51:56Z

For large models, the unstable offset phenomenon is indeed very frequent. 1） start from the most recent normal model 2) may try to use a smaller learning rate. (Sometimes, too large restart learning rate can also lead to this problem. You can use a smaller learning rate for restarts (by setting restart_weights) )
We do not have an effective way to prevent it.

DenisDiachkov · 2020-06-06T13:03:25Z

Hello, I am trying to train this chain C64B10woTSA -> C128B10woTSA -> C128B40woTSA -> C128B40wTSA. I encounter this kind of errors on the second stage

File "train.py", line 311, in <module>
    main()
  File "train.py", line 130, in main
    model = create_model(opt)
  File "/data/denis/EDVR-master/codes/models/__init__.py", line 17, in create_model
    m = M(opt)
  File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 33, in __init__
    self.load()
  File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 163, in load
    self.load_network(load_path_G, self.netG, self.opt['path']['strict_load'])
  File "/data/denis/EDVR-master/codes/models/base_model.py", line 94, in load_network
    network.load_state_dict(load_net_clean, strict=strict)
  File "/data/anaconda3/envs/dwx815999/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
Traceback (most recent call last):
  File "train.py", line 311, in <module>
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for EDVR:
size mismatch for upconv1.bias: copying a param with shape torch.Size([256])
 from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for upconv2.weight: copying a param with shape torch.Size([256, 64, 3, 3]) from
checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).

I am trying to change nf from 64 to 128, strict_load is false. Am I doing everything right? I searched this on the internet and it seems it is not possible to use load_state_dict() for loading nf=64 model into nf=128 model.

Here is the config file which produces the error

#### general settings
name: C128RB10woTSA_001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
use_tb_logger: false
model: video_base
distortion: sr
scale: 4
gpu_ids: [0,1,2,3,4,5]

#### datasets
datasets:
  train:
    name: REDS
    mode: REDS
    interval_list: [1]
    random_reverse: false
    border_mode: false
    dataroot_GT: ../datasets/REDS/old_gt #../datasets/REDS/train_sharp_wval.lmdb
    dataroot_LQ: ../datasets/REDS/old_sblur #../datasets/REDS/train_sharp_bicubic_wval.lmdb
    cache_keys: ~

    N_frames: 5
    use_shuffle: true
    n_workers: 3  # per GPU
    batch_size: 120
    GT_size: 128
    LQ_size: 128
    use_flip: true
    use_rot: true
    color: RGB

#### network structures
network_G:
  which_model_G: EDVR
  nf: 128
  nframes: 5
  groups: 8
  front_RBs: 5
  back_RBs: 10
  predeblur: false
  HR_in: true
  w_TSA: false

#### path
path:
  pretrain_model_G: /data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S_archived_200606-082253/models/150000_G.pth
  strict_load: false
  resume_state: ~
  #/data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/training_state/150000.state

#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 4e-4
  lr_scheme: CosineAnnealingLR_Restart
  beta1: 0.9
  beta2: 0.99
  niter: 600000
  warmup_iter: -1  # -1: no warm up
  T_period: [150000, 150000, 150000, 150000]
  restarts: [150000, 300000, 450000]
  restart_weights: [1, 1, 1]
  eta_min: !!float 1e-7

  pixel_criterion: cb
  pixel_weight: 1.0
  val_freq: !!float 5e3

  manual_seed: 0

#### logger
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 5e3

Pretrained model G is same but nf=64

NasrinR791 mentioned this issue Dec 2, 2019

RuntimeError: Error(s) in loading state_dict for EDVR #123

Open

TouqeerAhmad closed this as completed Dec 5, 2019

iamzxw mentioned this issue Dec 11, 2019

there are no warnings 'offset mean larger than 100' when training, but there are warnings when testing #126

Open

vlee-harmonicinc mentioned this issue May 7, 2020

occasions where sampling locations are outside of the image boundary msracver/Deformable-ConvNets#278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly deal with 'Offset mean more than 100' Warning #91

How to properly deal with 'Offset mean more than 100' Warning #91

TouqeerAhmad commented Sep 3, 2019

TouqeerAhmad commented Sep 4, 2019

xinntao commented Sep 6, 2019

TouqeerAhmad commented Sep 6, 2019

TouqeerAhmad commented Sep 10, 2019

xinntao commented Sep 16, 2019 •

edited

Loading

DenisDiachkov commented Jun 6, 2020

How to properly deal with 'Offset mean more than 100' Warning #91

How to properly deal with 'Offset mean more than 100' Warning #91

Comments

TouqeerAhmad commented Sep 3, 2019

TouqeerAhmad commented Sep 4, 2019

xinntao commented Sep 6, 2019

TouqeerAhmad commented Sep 6, 2019

TouqeerAhmad commented Sep 10, 2019

xinntao commented Sep 16, 2019 • edited Loading

DenisDiachkov commented Jun 6, 2020

xinntao commented Sep 16, 2019 •

edited

Loading