error when running train.py #6

qingzi02010 · 2019-12-27T09:12:52Z

line 135 in train.py:
fake, latent = generator.module([test_in], return_latents=True)

when run train.py, it seems that "generator.module
AttributeError: 'Generator' object has no attribute 'module'".
My pytorch is 1.3.1, do you have some idea about that? Thank you so much!

onion-liu · 2019-12-27T09:18:32Z

It seems you run the code on a single GPU, you should remove .module, which appears when using DistributedDataParallel.

qingzi02010 · 2019-12-27T13:57:28Z

Yes, you are right, I only debug the code on a single GPU. So I had better run a distributed GPUs to get a correct result?

rosinality · 2019-12-27T23:56:26Z

If you want to test, you can run on single GPU with 71463b9.

qingzi02010 · 2019-12-29T14:26:02Z

1)Hi, wether running on a single GPU or distributed GPUs, the printed values “d_loss_val:.4f}; g: {g_loss_val:.4f}; r1: {r1_val:.4f}; ' f'path: {path_loss_val:.4f}; mean path: {mean_path_length_avg:.4f}'” are all nan.
2)I debug with a single GPU, it appears that , requires_grad(generator, False) fake_img, _ = generator(noise1) generate fake_img, of which fake_img[0:4] is 0,0,-inf,-inf.
3)requires_grad(generator, True) fake_img, _ = generator(noise1) generate fake_img, of which fake_img[0:4] is -inf,-inf,-inf,-inf.

Do you have some idea about that? Thank you very much for your help!

rosinality · 2019-12-29T15:22:00Z

I suspect loss has been exploded after some iterations. Could I know your batch sizes? Maybe you can use lower learning rates and retry.

qingzi02010 · 2019-12-30T01:12:57Z

On one single GPU, the batch size is 4. The problem occurs at the first time calling the generator forward function.

rosinality · 2019-12-30T01:31:29Z

Could you try this?

import torch
from model import Generator

g = Generator(256, 512, 8).to('cuda')
x = torch.randn(4, 512).to('cuda')

print(g([x]))

If problem really occurs even at first forward, then maybe there are some numerical errors in generator implementations. But currently I don't know what could it be.

qingzi02010 · 2019-12-30T02:10:21Z

path[0].backward()

if I comment out this line, the value generated by generator is normal

rosinality · 2019-12-30T02:16:49Z

Hmm I don't know why as it should not update generator itself.

Now I suspect that it is related to path length regularization, as comment out path[0].backward will cause to ignore gradients calculated for path length regularization.

qingzi02010 · 2019-12-30T07:25:08Z

path = g_path_regularize(fake, latent, 0) path[0].backward()， function 'g_path_regularize' include operation grad, = autograd.grad( outputs=(fake_img * noise).sum(), inputs=latents, create_graph=True ),. 'fake_img' is generated by generator, so backward will update generator, right?

rosinality · 2019-12-30T08:51:14Z

It will calculate gradients, but optimizer.step() is omitted so it will not update the parameters of the generator. As generator.zero_grad() will be called later so grad buffers will be cleaned.

qingzi02010 · 2019-12-30T09:27:11Z

 if args.distributed:
        fake, latent = generator.module([test_in], return_latents=True)     #for distributed gpu
    else:
        fake, latent = generator([test_in], return_latents=True)
    # for name,param in generator.named_parameters():
    #     if param.ndim==1:
    #         aa=str(param[0].cpu().detach().numpy())
    #     elif param.ndim==2:
    #         aa=str(param[0][0].cpu().detach().numpy())
    #     else:
    #         aa='param.ndim>2'
    #     print(name,aa)
    #     with open('generator_name_param_0.txt','a') as f:
    #         f.write(name+'####'+aa+'\n')
    path = g_path_regularize(fake, latent, 0)
    path[0].backward()
    # for name,param in generator.named_parameters():
    #     if param.ndim==1:
    #         aa=str(param[0].cpu().detach().numpy())
    #     elif param.ndim==2:
    #         aa=str(param[0][0].cpu().detach().numpy())
    #     else:
    #         aa='param.ndim>2'
    #     with open('generator_name_param_1.txt','a') as f:
    #         f.write(name+'####'+aa+'\n')
    fake_test, __ = generator([test_in], return_latents=True)

Yes, I have checked that the parameter of generator is not updated. But, I check like that, fake is a normal value, but fake_test is inf. It is so strange. Can you reproduce that on one single gpu?

rosinality · 2019-12-30T09:47:27Z

I tested it, but in my cases it works without problem.

qingzi02010 · 2019-12-31T09:04:41Z

if __name__ == '__main_'_:
    device = 'cuda'
    parser = argparse.ArgumentParser()
    parser.add_argument('--size', type=int, default=256)
    parser.add_argument('--channel_multiplier', type=int, default=2)
    args = parser.parse_args()
    args.latent = 512
    args.n_mlp = 8
    generator = Generator( args.size, args.latent, args.n_mlp, channel_multiplier=args.channel_multiplier).to(device)
    
    test_in = torch.randn(1, args.latent, device=device)
    fake, latent = generator([test_in], return_latents=True)

    path = g_path_regularize(fake, latent, 0)
    path.backward()

    fake_test, __ = generator([test_in], return_latents=True)

This is the main function in train.py, and it is so simple. The same error occured, 'fake' is normal, fake_test is inf, if 'path.backward()' is commented out, fake_test is normal.
I am so confused about that problem, it is very strange, and do not know how to solve. I did not use data, the problem seem has no relation with data.

rosinality · 2019-12-31T14:44:49Z

Again, in my cases it works without a problem. Maybe there are some problems in custom kernels...Anyway, could you print output tensors during forward calculations? For example, add print(out) in the forward function of the generator.

qingzi02010 · 2020-01-06T07:11:13Z

Sorry, maybe it is the cuda and cudnn setting in my envs that caused the unnormal problem. I reset the configuration of the cuda and cudnn, the training process is normal. Thank you so much for your support!

CrossLee1 · 2021-04-01T07:28:31Z

@qingzi02010 what is your cuda version and cudnn version, I also encounter loss nan problem. Thanks

qingzi02010 · 2021-04-02T03:26:34Z

@CrossLee1 My cuda version is 10.0.130, cudnn version is 7.5.0

This comment has been minimized.

Sign in to view

qingzi02010 closed this as completed Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when running train.py #6

error when running train.py #6

qingzi02010 commented Dec 27, 2019

onion-liu commented Dec 27, 2019 •

edited

qingzi02010 commented Dec 27, 2019

rosinality commented Dec 27, 2019

This comment has been minimized.

qingzi02010 commented Dec 29, 2019

rosinality commented Dec 29, 2019

qingzi02010 commented Dec 30, 2019

rosinality commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 30, 2019 •

edited

rosinality commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 30, 2019 •

edited

rosinality commented Dec 30, 2019

qingzi02010 commented Dec 30, 2019 •

edited

rosinality commented Dec 30, 2019

qingzi02010 commented Dec 31, 2019 •

edited

rosinality commented Dec 31, 2019

qingzi02010 commented Jan 6, 2020

CrossLee1 commented Apr 1, 2021 •

edited

qingzi02010 commented Apr 2, 2021

error when running train.py #6

error when running train.py #6

Comments

qingzi02010 commented Dec 27, 2019

onion-liu commented Dec 27, 2019 • edited

qingzi02010 commented Dec 27, 2019

rosinality commented Dec 27, 2019

This comment has been minimized.

qingzi02010 commented Dec 29, 2019

rosinality commented Dec 29, 2019

qingzi02010 commented Dec 30, 2019

rosinality commented Dec 30, 2019 • edited

qingzi02010 commented Dec 30, 2019 • edited

rosinality commented Dec 30, 2019 • edited

qingzi02010 commented Dec 30, 2019 • edited

rosinality commented Dec 30, 2019

qingzi02010 commented Dec 30, 2019 • edited

rosinality commented Dec 30, 2019

qingzi02010 commented Dec 31, 2019 • edited

rosinality commented Dec 31, 2019

qingzi02010 commented Jan 6, 2020

CrossLee1 commented Apr 1, 2021 • edited

qingzi02010 commented Apr 2, 2021

onion-liu commented Dec 27, 2019 •

edited

rosinality commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 30, 2019 •

edited

rosinality commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 30, 2019 •

edited

qingzi02010 commented Dec 31, 2019 •

edited

CrossLee1 commented Apr 1, 2021 •

edited