Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when running train.py #6

Closed
qingzi02010 opened this issue Dec 27, 2019 · 19 comments
Closed

error when running train.py #6

qingzi02010 opened this issue Dec 27, 2019 · 19 comments

Comments

@qingzi02010
Copy link

line 135 in train.py:
fake, latent = generator.module([test_in], return_latents=True)

when run train.py, it seems that "generator.module
AttributeError: 'Generator' object has no attribute 'module'".
My pytorch is 1.3.1, do you have some idea about that? Thank you so much!

@onion-liu
Copy link
Contributor

onion-liu commented Dec 27, 2019

It seems you run the code on a single GPU, you should remove .module, which appears when using DistributedDataParallel.

@qingzi02010
Copy link
Author

Yes, you are right, I only debug the code on a single GPU. So I had better run a distributed GPUs to get a correct result?

@rosinality
Copy link
Owner

If you want to test, you can run on single GPU with 71463b9.

@qingzi02010

This comment has been minimized.

@qingzi02010
Copy link
Author

1)Hi, wether running on a single GPU or distributed GPUs, the printed values “d_loss_val:.4f}; g: {g_loss_val:.4f}; r1: {r1_val:.4f}; ' f'path: {path_loss_val:.4f}; mean path: {mean_path_length_avg:.4f}'” are all nan.
2)I debug with a single GPU, it appears that , requires_grad(generator, False) fake_img, _ = generator(noise1) generate fake_img, of which fake_img[0:4] is 0,0,-inf,-inf.
3)requires_grad(generator, True) fake_img, _ = generator(noise1) generate fake_img, of which fake_img[0:4] is -inf,-inf,-inf,-inf.

Do you have some idea about that? Thank you very much for your help!

@rosinality
Copy link
Owner

I suspect loss has been exploded after some iterations. Could I know your batch sizes? Maybe you can use lower learning rates and retry.

@qingzi02010
Copy link
Author

On one single GPU, the batch size is 4. The problem occurs at the first time calling the generator forward function.

@rosinality
Copy link
Owner

rosinality commented Dec 30, 2019

Could you try this?

import torch
from model import Generator

g = Generator(256, 512, 8).to('cuda')
x = torch.randn(4, 512).to('cuda')

print(g([x]))

If problem really occurs even at first forward, then maybe there are some numerical errors in generator implementations. But currently I don't know what could it be.

@qingzi02010
Copy link
Author

qingzi02010 commented Dec 30, 2019

path[0].backward()

if I comment out this line, the value generated by generator is normal

@rosinality
Copy link
Owner

rosinality commented Dec 30, 2019

Hmm I don't know why as it should not update generator itself.

Now I suspect that it is related to path length regularization, as comment out path[0].backward will cause to ignore gradients calculated for path length regularization.

@qingzi02010
Copy link
Author

qingzi02010 commented Dec 30, 2019

path = g_path_regularize(fake, latent, 0) path[0].backward(), function 'g_path_regularize' include operation grad, = autograd.grad( outputs=(fake_img * noise).sum(), inputs=latents, create_graph=True ),. 'fake_img' is generated by generator, so backward will update generator, right?

@rosinality
Copy link
Owner

It will calculate gradients, but optimizer.step() is omitted so it will not update the parameters of the generator. As generator.zero_grad() will be called later so grad buffers will be cleaned.

@qingzi02010
Copy link
Author

qingzi02010 commented Dec 30, 2019

 if args.distributed:
        fake, latent = generator.module([test_in], return_latents=True)     #for distributed gpu
    else:
        fake, latent = generator([test_in], return_latents=True)
    # for name,param in generator.named_parameters():
    #     if param.ndim==1:
    #         aa=str(param[0].cpu().detach().numpy())
    #     elif param.ndim==2:
    #         aa=str(param[0][0].cpu().detach().numpy())
    #     else:
    #         aa='param.ndim>2'
    #     print(name,aa)
    #     with open('generator_name_param_0.txt','a') as f:
    #         f.write(name+'####'+aa+'\n')
    path = g_path_regularize(fake, latent, 0)
    path[0].backward()
    # for name,param in generator.named_parameters():
    #     if param.ndim==1:
    #         aa=str(param[0].cpu().detach().numpy())
    #     elif param.ndim==2:
    #         aa=str(param[0][0].cpu().detach().numpy())
    #     else:
    #         aa='param.ndim>2'
    #     with open('generator_name_param_1.txt','a') as f:
    #         f.write(name+'####'+aa+'\n')
    fake_test, __ = generator([test_in], return_latents=True)

Yes, I have checked that the parameter of generator is not updated. But, I check like that, fake is a normal value, but fake_test is inf. It is so strange. Can you reproduce that on one single gpu?

@rosinality
Copy link
Owner

I tested it, but in my cases it works without problem.

@qingzi02010
Copy link
Author

qingzi02010 commented Dec 31, 2019

if __name__ == '__main_'_:
    device = 'cuda'
    parser = argparse.ArgumentParser()
    parser.add_argument('--size', type=int, default=256)
    parser.add_argument('--channel_multiplier', type=int, default=2)
    args = parser.parse_args()
    args.latent = 512
    args.n_mlp = 8
    generator = Generator( args.size, args.latent, args.n_mlp, channel_multiplier=args.channel_multiplier).to(device)
    
    test_in = torch.randn(1, args.latent, device=device)
    fake, latent = generator([test_in], return_latents=True)

    path = g_path_regularize(fake, latent, 0)
    path.backward()

    fake_test, __ = generator([test_in], return_latents=True)

This is the main function in train.py, and it is so simple. The same error occured, 'fake' is normal, fake_test is inf, if 'path.backward()' is commented out, fake_test is normal.
I am so confused about that problem, it is very strange, and do not know how to solve. I did not use data, the problem seem has no relation with data.

@rosinality
Copy link
Owner

Again, in my cases it works without a problem. Maybe there are some problems in custom kernels...Anyway, could you print output tensors during forward calculations? For example, add print(out) in the forward function of the generator.

@qingzi02010
Copy link
Author

Sorry, maybe it is the cuda and cudnn setting in my envs that caused the unnormal problem. I reset the configuration of the cuda and cudnn, the training process is normal. Thank you so much for your support!

@CrossLee1
Copy link

CrossLee1 commented Apr 1, 2021

@qingzi02010 what is your cuda version and cudnn version, I also encounter loss nan problem. Thanks

@qingzi02010
Copy link
Author

@CrossLee1 My cuda version is 10.0.130, cudnn version is 7.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants