Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adam optim ERROR:If capturable=False, state_steps should not be CUDA tensors. #681

Closed
jaried opened this issue Jul 3, 2022 · 3 comments
Closed
Labels
question Further information is requested

Comments

@jaried
Copy link

jaried commented Jul 3, 2022

  • I have marked all applicable categories:
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
import tianshou, gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.4.8 0.21.0 1.12.0+cu113 1.20.1 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] win32

The SAC algorithm has been normal before. Recently, after reading the saved policy weight and optimizer, the first learning error is as follows. How can I solve it?

Epoch #1:   0%|                                                                                                                                                                                 | 23/57600 [00:02<1:37:11,  9.87it/s]
('If capturable=False, state_steps should not be CUDA tensors.',)
===========
Traceback (most recent call last):
  File "D:\Tony\Documents\yunpan\invest\2022\Quant\code\factor\myfactor.py", line 684, in wrapper
    func(*args, **kw)
  File "tianshou_if.py", line 378, in sac_with_il_if
    result = offpolicy_trainer(
  File "D:\Anaconda3\lib\site-packages\tianshou\trainer\offpolicy.py", line 129, in offpolicy_trainer
    return OffpolicyTrainer(*args, **kwargs).run()
  File "D:\Anaconda3\lib\site-packages\tianshou\trainer\base.py", line 425, in run
    deque(self, maxlen=0)  # feed the entire iterator into a zero-length deque
  File "D:\Anaconda3\lib\site-packages\tianshou\trainer\base.py", line 282, in __next__
    self.policy_update_fn(data, result)
  File "D:\Anaconda3\lib\site-packages\tianshou\trainer\offpolicy.py", line 118, in policy_update_fn
    losses = self.policy.update(self.batch_size, self.train_collector.buffer)
  File "D:\Anaconda3\lib\site-packages\tianshou\policy\base.py", line 277, in update
    result = self.learn(batch, **kwargs)
  File "D:\Anaconda3\lib\site-packages\tianshou\policy\modelfree\sac.py", line 149, in learn
    td1, critic1_loss = self._mse_optimizer(
  File "D:\Anaconda3\lib\site-packages\tianshou\policy\modelfree\ddpg.py", line 158, in _mse_optimizer
    optimizer.step()
  File "D:\Anaconda3\lib\site-packages\torch\optim\optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\torch\optim\adam.py", line 157, in step
    adam(params_with_grad,
  File "D:\Anaconda3\lib\site-packages\torch\optim\adam.py", line 213, in adam
    func(params,
  File "D:\Anaconda3\lib\site-packages\torch\optim\adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
@jaried
Copy link
Author

jaried commented Jul 3, 2022

Before executing offpolicy_trainer, I added the following code:

        actor_optim.param_groups[0]['capturable'] = True
        alpha_optim.param_groups[0]['capturable'] = True
        critic1_optim.param_groups[0]['capturable'] = True
        critic2_optim.param_groups[0]['capturable'] = True

can run, what is the reason?

@jaried
Copy link
Author

jaried commented Jul 3, 2022

This article says that it is caused by pressing ctrl+c to end the training?
babysor/MockingBird#631

But I use the previous version of the checkpoint file in question, and the same error is reported, why is this?

@jaried
Copy link
Author

jaried commented Jul 4, 2022

pytorch/pytorch#80809

Someone said this:

Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12).

It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:

  1. forcing capturable = True after loading the checkpoint (as suggested above) optim.param_groups[0]['capturable'] = True . This seems to slow down the model training by approx. 10% (YMMV depending on the setup).
  2. Reverting pytorch back to previous versions (I have been using 1.11.0).

I'm wondering whether enforcing capturable = True may incur unwanted side effects.

I'm also wondering about whether forcing captureable=True would have unwanted side effects. I will also return to torch1.11.

@jaried jaried closed this as completed Jul 4, 2022
@Trinkle23897 Trinkle23897 added the question Further information is requested label Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants