Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly deal with 'Offset mean more than 100' Warning #91

Closed
TouqeerAhmad opened this issue Sep 3, 2019 · 7 comments
Closed

Comments

@TouqeerAhmad
Copy link

Hi Xinntao,

Can you please comment on how to specifically deal with the offset warning in the PCD alignment module? I am trying to train an 'L' model and have ambiguities from the details I could piece together from issues #16 and #22. I have put together the workflow required to train an 'L' model below, can you please comment about its correctness.

  1. The original 'M' and 'L' models have 10 and 40 RBs respectively in the reconstruction module and the number of channels are also different i.e. 64 and 128 respectively.
  2. According to your suggestion of initializing a deeper model with a shallower model, you are referring to using an 'M' model (wo TSA) with 128 feature maps and not with 64? -- as the lateral would not work.
  3. Once the 'M' model (wo TSA) with 128 feature maps is trained, we can initialize the 'L' model (wo TSA) using the 'M' model where for each 10 RBs (out of 40) of 'L' model are initialized with 10 RBs from 'M' model.
  4. Then the 'L' model (wo TSA) can be trained, subsequently we can add TSA module -- train only the TSA module first and then the full network as suggest in the yml file for 'M' model,
@TouqeerAhmad
Copy link
Author

Also, one of the suggestion seems to be using lower learning rate for PCD module -- can you please share the code file to do that. And also what is appropriate learning rate for PCD module?

@xinntao
Copy link
Owner

xinntao commented Sep 6, 2019

First of all, we find that 1) we can train 'M' models from scratch without such offset warnings. That is, the training is stable. 2) However, if we train 'L' models, it is very fragile and the offset warnings appear occasionally. If the offset is large than 100, it means that the offsets in dcn are wrongly predicted (too large offsets are meaningless ). The performance of these models is also poor. We think the reason is that when we train the large model with dcn, the offsets in dcn is more fragile.

In the competition, we train such large models from smaller models (from C=64 models to C=128 models and then to B=40 models). Even with such training schemes, we still encounter the wrong/too large offsets. We do not have a nice solution and just stop it and resume from the nearest normal model. Here, normal model means that their offsets are normal and are not too large.

The training procedures in the competitions are complex and actually we do not remember the concrete steps. We now provide the training schemes for the "M" models.
We want to provide a simple and effective way to reproduce the "L" models.
I think I need another two or three weeks to explore such ways.

  1. Right
  2. In the competition, we use a strategy, but we do not remember the concrete steps.
    Based on experiences and past experiments, C64B10woTSA -> C128B10woTSA -> C128B40woTSA -> C128B40wTSA seems a reasonable path. But the final performance of this path is unknown. I will also conduct experiments these weeks. If have results, I will update it.
  3. We use a half smaller lr for PCD module during the competition, but I wonder whether this is necessary. Because in the 'M' model training, it is not necessary.
  4. If you encounter the offset warning, just stop it and resume from the nearest normal model.

We are developing more stable and efficient models, but the work is still in progress.

@TouqeerAhmad
Copy link
Author

Thank you! this clears my doubts.

@TouqeerAhmad
Copy link
Author

For me it is becoming more an more frequent as the training progresses -- seems to be very unstable.

@xinntao
Copy link
Owner

xinntao commented Sep 16, 2019

For large models, the unstable offset phenomenon is indeed very frequent. 1) start from the most recent normal model 2) may try to use a smaller learning rate. (Sometimes, too large restart learning rate can also lead to this problem. You can use a smaller learning rate for restarts (by setting restart_weights) )
We do not have an effective way to prevent it.

@DenisDiachkov
Copy link

Hello, I am trying to train this chain C64B10woTSA -> C128B10woTSA -> C128B40woTSA -> C128B40wTSA. I encounter this kind of errors on the second stage

File "train.py", line 311, in <module>
    main()
  File "train.py", line 130, in main
    model = create_model(opt)
  File "/data/denis/EDVR-master/codes/models/__init__.py", line 17, in create_model
    m = M(opt)
  File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 33, in __init__
    self.load()
  File "/data/denis/EDVR-master/codes/models/Video_base_model.py", line 163, in load
    self.load_network(load_path_G, self.netG, self.opt['path']['strict_load'])
  File "/data/denis/EDVR-master/codes/models/base_model.py", line 94, in load_network
    network.load_state_dict(load_net_clean, strict=strict)
  File "/data/anaconda3/envs/dwx815999/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
Traceback (most recent call last):
  File "train.py", line 311, in <module>
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for EDVR:
size mismatch for upconv1.bias: copying a param with shape torch.Size([256])
 from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for upconv2.weight: copying a param with shape torch.Size([256, 64, 3, 3]) from
checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).

I am trying to change nf from 64 to 128, strict_load is false. Am I doing everything right? I searched this on the internet and it seems it is not possible to use load_state_dict() for loading nf=64 model into nf=128 model.

Here is the config file which produces the error

#### general settings
name: C128RB10woTSA_001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
use_tb_logger: false
model: video_base
distortion: sr
scale: 4
gpu_ids: [0,1,2,3,4,5]

#### datasets
datasets:
  train:
    name: REDS
    mode: REDS
    interval_list: [1]
    random_reverse: false
    border_mode: false
    dataroot_GT: ../datasets/REDS/old_gt #../datasets/REDS/train_sharp_wval.lmdb
    dataroot_LQ: ../datasets/REDS/old_sblur #../datasets/REDS/train_sharp_bicubic_wval.lmdb
    cache_keys: ~

    N_frames: 5
    use_shuffle: true
    n_workers: 3  # per GPU
    batch_size: 120
    GT_size: 128
    LQ_size: 128
    use_flip: true
    use_rot: true
    color: RGB

#### network structures
network_G:
  which_model_G: EDVR
  nf: 128
  nframes: 5
  groups: 8
  front_RBs: 5
  back_RBs: 10
  predeblur: false
  HR_in: true
  w_TSA: false

#### path
path:
  pretrain_model_G: /data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S_archived_200606-082253/models/150000_G.pth
  strict_load: false
  resume_state: ~
  #/data/denis/EDVR-master/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/training_state/150000.state

#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 4e-4
  lr_scheme: CosineAnnealingLR_Restart
  beta1: 0.9
  beta2: 0.99
  niter: 600000
  warmup_iter: -1  # -1: no warm up
  T_period: [150000, 150000, 150000, 150000]
  restarts: [150000, 300000, 450000]
  restart_weights: [1, 1, 1]
  eta_min: !!float 1e-7

  pixel_criterion: cb
  pixel_weight: 1.0
  val_freq: !!float 5e3

  manual_seed: 0

#### logger
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 5e3

Pretrained model G is same but nf=64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@TouqeerAhmad @xinntao @DenisDiachkov and others