New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The gradient of odeint_adjoint is zero with multiple GPUs #119
Comments
Thanks for reporting this! Went down a rabbit hole, but I found the source of the problem. It's due to nn.DataParallel's I'd think torchdyn's handling of parameters would have the same problem in this regard though. I'll add a fix for this soon. |
Thank you! Your software is really brilliant! |
Should be fixed with commit d58887f. Got ffjord running. You can install the latest version using
Let me know if it still doesn't work for you. |
I run the command and still cannot get ffjord running.
|
I copied the latest torchdiffeq folder directly to ffjord and uninstalled torchdiffeq in the system, but ffjord train_cnf.py still gives zero gradient with multiple GPUs. |
Can you install using
and try again? I haven't updated the version yet. |
Oh, I was testing on pytorch 1.6! Can you try updating? If not, I'll take another look tomorrow with 1.5. |
Pytorch 1.6 works with your solution. You are amazing! Thank you! |
I found that using exactly the same code, I got the following results:
My pytorch version is 1.5.0, torchdiffeq version is 0.1.0., CUDA version is 10.0.130, python version is 3.7.7.
I noticed that in your implementation of adjoint method, you put the odeint under torch.no_grad while torchdyn did not.
This is your code:
This is their code: (https://github.com/DiffEqML/torchdyn/blob/master/torchdyn/sensitivity/adjoint.py)
Also, I found that your FFJORD code also worked with single GPU but failed with multiple GPUs:
The running command is:
The text was updated successfully, but these errors were encountered: