New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error when running train.py #6
Comments
It seems you run the code on a single GPU, you should remove |
Yes, you are right, I only debug the code on a single GPU. So I had better run a distributed GPUs to get a correct result? |
If you want to test, you can run on single GPU with 71463b9. |
This comment has been minimized.
This comment has been minimized.
1)Hi, wether running on a single GPU or distributed GPUs, the printed values “d_loss_val:.4f}; g: {g_loss_val:.4f}; r1: {r1_val:.4f}; ' f'path: {path_loss_val:.4f}; mean path: {mean_path_length_avg:.4f}'” are all nan. Do you have some idea about that? Thank you very much for your help! |
I suspect loss has been exploded after some iterations. Could I know your batch sizes? Maybe you can use lower learning rates and retry. |
On one single GPU, the batch size is 4. The problem occurs at the first time calling the generator forward function. |
Could you try this? import torch
from model import Generator
g = Generator(256, 512, 8).to('cuda')
x = torch.randn(4, 512).to('cuda')
print(g([x])) If problem really occurs even at first forward, then maybe there are some numerical errors in generator implementations. But currently I don't know what could it be. |
if I comment out this line, the value generated by generator is normal |
Hmm I don't know why as it should not update generator itself. Now I suspect that it is related to path length regularization, as comment out path[0].backward will cause to ignore gradients calculated for path length regularization. |
|
It will calculate gradients, but optimizer.step() is omitted so it will not update the parameters of the generator. As generator.zero_grad() will be called later so grad buffers will be cleaned. |
Yes, I have checked that the parameter of generator is not updated. But, I check like that, |
I tested it, but in my cases it works without problem. |
This is the main function in train.py, and it is so simple. The same error occured, 'fake' is normal, fake_test is inf, if 'path.backward()' is commented out, fake_test is normal. |
Again, in my cases it works without a problem. Maybe there are some problems in custom kernels...Anyway, could you print output tensors during forward calculations? For example, add print(out) in the forward function of the generator. |
Sorry, maybe it is the cuda and cudnn setting in my envs that caused the unnormal problem. I reset the configuration of the cuda and cudnn, the training process is normal. Thank you so much for your support! |
@qingzi02010 what is your cuda version and cudnn version, I also encounter loss nan problem. Thanks |
@CrossLee1 My cuda version is 10.0.130, cudnn version is 7.5.0 |
line 135 in train.py:
fake, latent = generator.module([test_in], return_latents=True)
when run train.py, it seems that "generator.module
AttributeError: 'Generator' object has no attribute 'module'".
My pytorch is 1.3.1, do you have some idea about that? Thank you so much!
The text was updated successfully, but these errors were encountered: