New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809
Comments
I also get this problem as well. import tianshou, gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.4.8 0.21.0 1.12.0+cu113 1.20.1 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] win32 see: I set all the optimizers to the following settings, and they can train normally. I also ask, what is the problem?Does my setting have any effect on training? optim.param_groups[0]['capturable'] = True |
Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12). It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:
I'm wondering whether enforcing |
I'm also wondering about whether forcing captureable=True would have unwanted side effects. I will also return to torch1.11. Thank you for your answer. |
I'm also having this same error with pytorch=1.12 and needed to downgrade to pytorch=1.11. |
Thanks guys, I successfully resolve this! |
I also had this issue, my workaround was to comment out lines 202-204 in pytorch_lightning.trainer.connectors.checkpoint_connector.py
to find the file, you can do the following (inside a jupyter notebook)
another option is to manually load the checkpoint without the optimizers. For example to just load the saved model weights you could do
|
Personally, I feel like this issue should remain open. I think this is an inconsistency between stable pytorch versions and I would appreciate being able to run my code base on future pytorch versions. |
Hi, We're sorry to have introduced this regression. We will fix that in the upcoming minor release for 1.12.1 |
Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser
Summary: Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9d20af50608b146fe1c3296210a05cd8e4c60af2 Reviewed By: mehtanirav Differential Revision: D37687409 Pulled By: albanD fbshipit-source-id: 4b899f76cbcb582cded8649e1166df90e73d78e9
I know that this is closed, but I've encountered this issue multiple times on a few 'colab-a-like's and this post is the first that comes up. For anyone in the future, I want to mention that instead of setting capturable = True, you can instead call .cpu() on the tensors with key "step" in the state dictionary. In my case, I found this cobbled together bit of code to be sufficient:
|
@albanD Could you explain what is |
Hi, This is to be used in conjunction with cuda graph. In particular, all ops must happen on the GPU for cuda graph to be able to "capture" all of them. |
Finish fixing pytorch#80809 Pull Request resolved: pytorch#80881 Approved by: https://github.com/jbschlosser
Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser Co-authored-by: albanD <desmaison.alban@gmail.com>
I was training an ESRGAN and my solution after kernel timeout was to reload the model states and downgrade pytorch to 1.11 with cu11.3: if you are using cuda binaries 11.6 with pytorch 1.12.0 then on command prompt do a :
|
rename sac to rl in agent and sac_main remove temp issue pytorch/pytorch#80809 add depth camera to realcar add dm_control mujoco sample
File "/root/autodl-tmp/DietNeRF-master/dietnerf/run_nerf.py", line 5, in |
It works when upcoming release for torch1.12.1. Thank you. |
hi all, without adding It is really puzzling. Any idea what's happening? I'm on torch 1.13.0+cu117 and I tried torch 2.0.0+cu117, both give the same problem. The optimizer was trained on a machine with torch 1.10.0, is this the root cause? But it's really difficult for me to install torch 1.10.0 on my current machine. |
Resuming from a checkpoint in torch==1.12.0 is broken, this was fixed in torch=1.12.1. This workaround allows to load checkpoints with version 1.12.0 as well. In pytorch/pytorch#80809 a 10% slowdown was reported, which I did not observe.
Resuming from a checkpoint in torch==1.12.0 is broken, this was fixed in torch=1.12.1. This workaround allows to load checkpoints with version 1.12.0 as well. In pytorch/pytorch#80809 a 10% slowdown was reported, which I did not observe.
🐛 Describe the bug
Hi, congratulations on your amazing work.
When I want to continue my training on model by loading checkpoint.py, under the circumstances that my GPUs are all perfectly fine, I got this:
Versions
The text was updated successfully, but these errors were encountered: