Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用checkpoint继续训练的bug #17

Closed
SimKarras opened this issue Jul 11, 2021 · 8 comments
Closed

使用checkpoint继续训练的bug #17

SimKarras opened this issue Jul 11, 2021 · 8 comments

Comments

@SimKarras
Copy link

SimKarras commented Jul 11, 2021

当我想要从断点继续训练,我修改了.yml文件以下内容:

# path
path:
  pretrain_network_g: experiments/train_GFPGANv1_512/models/net_g_490000.pth
  param_key_g: params_ema
  strict_load_g: ~
  pretrain_network_d: experiments/train_GFPGANv1_512/models/net_d_490000.pth
  pretrain_network_d_left_eye: experiments/train_GFPGANv1_512/models/net_d_left_eye_490000.pth
  pretrain_network_d_right_eye: experiments/train_GFPGANv1_512/models/net_d_right_eye_490000.pth
  pretrain_network_d_mouth: experiments/train_GFPGANv1_512/models/net_d_mouth_490000.pth
  pretrain_network_identity: experiments/pretrained_models/arcface_resnet18.pth
  # resume
  resume_state: experiments/train_GFPGANv1_512/training_states/490000.state
  ignore_resume_networks: ['network_identity']

我并没有修改pretrain_network_identity项。
但是随后报错:

FileNotFoundError: [Errno 2] No such file or directory: 'GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth'

一脸懵啊。。。
翻看log初始打印所有配置,此时pretrain_network_identity已经变了:

2021-07-11 22:21:11,000 INFO: Loading ResNetArcFace model from GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth.

这。。。。

@xinntao
Copy link
Member

xinntao commented Jul 12, 2021

@JiaweiShiCV 这是basicsr的一个bug,你可以更新一下basicsr (v1.3.3.5):

具体问题原因是这个: XPixelGroup/BasicSR@4a96712

@SimKarras SimKarras reopened this Jul 12, 2021
@SimKarras
Copy link
Author

SimKarras commented Jul 12, 2021

@xinntao pip install basicsr --upgrade 更新以后处理图片报错:

(BasicSR) ➜  GFPGAN git:(master) ✗ python inference_gfpgan_full.py --model_path experiments/pretrained_models/G8/net_g_480000.pth --test_path inputs/whole_imgs --paste_back
Processing 112.jpg ...
Traceback (most recent call last):
  File "inference_gfpgan_full.py", line 129, in <module>
    restoration(
  File "inference_gfpgan_full.py", line 52, in restoration
    output = gfpgan(cropped_face_t, return_rgb=False)[0]
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/文档/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward
    feat = self.conv_body_first(x)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward
    return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward
    out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined

然后我尝试卸载basicsr, 加上环境变量重新安装
BASICSR_EXT=True pip install basicsr
还是一样报错。。。
我暂时先换回1.3.3.4了

@SimKarras
Copy link
Author

新版本(1.3.3.5)下,stylegan的fused_act_ext编译有问题,导致训练开始不了。

@xinntao
Copy link
Member

xinntao commented Jul 12, 2021

这个版本相关的代码没有修改过。

你可以使用 git clone 来编译, 能够更好定位问题

  1. 先卸载现有的basicsr
  2. git clone https://github.com/xinntao/BasicSR.git
  3. 进入basicsr目录, 编译 BASICSR_EXT=True python setup.py develop

如果有问题,可以把输出贴一下, 1.3.3.5应该是没有影响的才对=-=

@SimKarras
Copy link
Author

@xinntao haha 我刚在两台机器上都试过了,无论是infer推演还是train,1.3.3.5都报错NameError: name 'fused_act_ext' is not defined, 。然后换1.3.3.4就和之前一样正常,1.3.3.4只有断点继续训练不行。

关于1.3.3.5多卡训练报错(和推演一样):

Traceback (most recent call last):
  File "train.py", line 10, in <module>
    train_pipeline(root_path)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/train.py", line 166, in train_pipeline
    model.optimize_parameters(current_iter)
  File "/home/shijiawei/data-vol-1/GFPGAN/models/gfpgan_model.py", line 307, in optimize_parameters
    self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 684, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shijiawei/data-vol-1/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward
    feat = self.conv_body_first(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward
    return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward
    out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined

@SimKarras
Copy link
Author

@xinntao 使用你上面的编译方式好像解决了。。。

@xinntao
Copy link
Member

xinntao commented Jul 12, 2021

ok,可能是上面没有卸载干净

或者是 BASICSR_EXT=True pip install basicsr 编译有问题, 这个可以通过 BASICSR_EXT=True pip -vvv install basicsr 来查看输出信息

@SimKarras
Copy link
Author

ok,可能是上面没有卸载干净

或者是 BASICSR_EXT=True pip install basicsr 编译有问题, 这个可以通过 BASICSR_EXT=True pip -vvv install basicsr 来查看输出信息

好的 thx!

@xinntao xinntao closed this as completed Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants