assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809

yqi19 · 2022-07-03T15:53:38Z

🐛 Describe the bug

Hi, congratulations on your amazing work.
When I want to continue my training on model by loading checkpoint.py, under the circumstances that my GPUs are all perfectly fine, I got this:

2022-07-03 06:06:18 - LOGS    - Exception occurred that interrupted the training. If capturable=False, state_steps shou
ld not be CUDA tensors.
If capturable=False, state_steps should not be CUDA tensors.

Traceback (most recent call last):                                                                           
  File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 682, in run
    train_loss, train_ckpt_metric = self.train_epoch(epoch)
  File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 353, in train_epoch
    self.gradient_scalar.step(optimizer=self.optimizer)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _may
be_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorat
e_context
    return func(*args, **kwargs)
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
    adamw(params_with_grad,
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
    func(params,
  File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tenso
r_adamw
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."

Versions

PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23

Python version: 3.9.12 (main, Jun  1 2022, 11:38:51)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.4.0-210-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA TITAN Xp
GPU 1: NVIDIA TITAN Xp
GPU 2: NVIDIA TITAN Xp
GPU 3: NVIDIA TITAN Xp

Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.0
[pip3] pytorchvideo==0.1.5
[pip3] torch==1.12.0
[pip3] torchvision==0.13.0
[conda] numpy                     1.23.0                   pypi_0    pypi
[conda] pytorchvideo              0.1.5                    pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi
[conda] torchvision               0.13.0                   pypi_0    pypi

jaried · 2022-07-03T16:59:20Z

I also get this problem as well.

import tianshou, gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.4.8 0.21.0 1.12.0+cu113 1.20.1 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] win32

see:
thu-ml/tianshou#681

I set all the optimizers to the following settings, and they can train normally. I also ask, what is the problem?Does my setting have any effect on training?

optim.param_groups[0]['capturable'] = True

L0SG · 2022-07-04T07:57:51Z

Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12).

It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:

forcing capturable = True after loading the checkpoint (as suggested above) optim.param_groups[0]['capturable'] = True . This seems to slow down the model training by approx. 10% (YMMV depending on the setup).
Reverting pytorch back to previous versions (I have been using 1.11.0).

I'm wondering whether enforcing capturable = True may incur unwanted side effects.

jaried · 2022-07-04T08:16:58Z

Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12).

It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:

forcing capturable = True after loading the checkpoint (as suggested above) optim.param_groups[0]['capturable'] = True . This seems to slow down the model training by approx. 10% (YMMV depending on the setup).

Reverting pytorch back to previous versions (I have been using 1.11.0).

I'm wondering whether enforcing capturable = True may incur unwanted side effects.

I'm also wondering about whether forcing captureable=True would have unwanted side effects. I will also return to torch1.11. Thank you for your answer.

amrosado · 2022-07-04T13:02:18Z

I'm also having this same error with pytorch=1.12 and needed to downgrade to pytorch=1.11.

yqi19 · 2022-07-04T15:04:14Z

Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12).

It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:

forcing capturable = True after loading the checkpoint (as suggested above) optim.param_groups[0]['capturable'] = True . This seems to slow down the model training by approx. 10% (YMMV depending on the setup).

Reverting pytorch back to previous versions (I have been using 1.11.0).

I'm wondering whether enforcing capturable = True may incur unwanted side effects.

Thanks guys, I successfully resolve this!

1lint · 2022-07-04T15:39:49Z

I also had this issue, my workaround was to comment out lines 202-204 in pytorch_lightning.trainer.connectors.checkpoint_connector.py

#if self.trainer.state.fn == TrainerFn.FITTING:
    # restore optimizers and schedulers state
    #self.restore_optimizers_and_schedulers()

to find the file, you can do the following (inside a jupyter notebook)

import pytorch_lightning.trainer.connectors.checkpoint_connector as module_to_edit
!code {module_to_edit.__file__}

another option is to manually load the checkpoint without the optimizers. For example to just load the saved model weights you could do

checkpoint = torch.load('/path/to/last.ckpt')
lightning_module.load_state_dict(checkpoint['state_dict'])

amrosado · 2022-07-04T18:53:55Z

Personally, I feel like this issue should remain open. I think this is an inconsistency between stable pytorch versions and I would appreciate being able to run my code base on future pytorch versions.

albanD · 2022-07-05T15:45:23Z

Hi,

We're sorry to have introduced this regression. We will fix that in the upcoming minor release for 1.12.1
If you want this fix earlier, you can follow the official instructions to get the nightly build of PyTorch!

Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser

Summary: Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9d20af50608b146fe1c3296210a05cd8e4c60af2 Reviewed By: mehtanirav Differential Revision: D37687409 Pulled By: albanD fbshipit-source-id: 4b899f76cbcb582cded8649e1166df90e73d78e9

Xact-sniper · 2022-07-16T07:02:51Z

I know that this is closed, but I've encountered this issue multiple times on a few 'colab-a-like's and this post is the first that comes up. For anyone in the future, I want to mention that instead of setting capturable = True, you can instead call .cpu() on the tensors with key "step" in the state dictionary.

In my case, I found this cobbled together bit of code to be sufficient:

    def nested_dict_iter(dict_obj, indent = 0):
        for key, value in dict_obj.items():
            if isinstance(value, dict):
                print(' ' * indent, key, ':', '{')
                TrainLoop.nested_dict_iter(value, indent + 4)
                print(' ' * indent, '}')
            elif isinstance(value, list):
                TrainLoop.nested_dict_iter(dict(zip(['list_'+str(i) for i in range(len(value))], value)), 4)                    
            else:
                #############
                #relevant portion
                if 'step' in key:
                    try:
                        tst = value.cpu()
                        assert torch.all(tst == value)
                    except:
                        pass
                    dict_obj[key]=tst
                print(' ' * indent, key, ':', value)
    def iter_nested_dict(dict_obj):
        print('{')
        TrainLoop.nested_dict_iter(dict_obj, 4)
        print('}')```

franchesoni · 2022-07-17T07:52:16Z

Hi,

We're sorry to have introduced this regression. We will fix that in the upcoming minor release for 1.12.1 If you want this fix earlier, you can follow the official instructions to get the nightly build of PyTorch!

@albanD Could you explain what is capturable and its side effects if any? when (not) to use it?

albanD · 2022-07-20T15:42:41Z

Hi,

This is to be used in conjunction with cuda graph. In particular, all ops must happen on the GPU for cuda graph to be able to "capture" all of them.
Passing the capturable flag will ensure that this is the case so that you can capture a whole forward/backward/optimizer step in a single cuda graph.

Finish fixing pytorch#80809 Pull Request resolved: pytorch#80881 Approved by: https://github.com/jbschlosser

Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser Co-authored-by: albanD <desmaison.alban@gmail.com>

cliffordkleinsr · 2022-08-02T08:13:02Z

I was training an ESRGAN and my solution after kernel timeout was to reload the model states and downgrade pytorch to 1.11 with cu11.3:
if using colab do a
!pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

if you are using cuda binaries 11.6 with pytorch 1.12.0 then on command prompt do a :

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

rename sac to rl in agent and sac_main remove temp issue pytorch/pytorch#80809 add depth camera to realcar add dm_control mujoco sample

dongsiwen · 2022-09-24T14:37:40Z

File "/root/autodl-tmp/DietNeRF-master/dietnerf/run_nerf.py", line 5, in
import clip_utils
ModuleNotFoundError: No module named 'clip_utils'

…checkpoints: pytorch/pytorch#80809).

zhilyzhang · 2023-04-15T09:12:54Z

Hi,

We're sorry to have introduced this regression. We will fix that in the upcoming minor release for 1.12.1 If you want this fix earlier, you can follow the official instructions to get the nightly build of PyTorch!

It works when upcoming release for torch1.12.1. Thank you.

linminhtoo · 2023-08-30T11:53:50Z

hi all, without adding optim.param_groups[0]['capturable'] = True, I get an assertion error, "If capturable=True"
and when i add this line, i also get the assertion error "If capturable=False".

It is really puzzling. Any idea what's happening? I'm on torch 1.13.0+cu117 and I tried torch 2.0.0+cu117, both give the same problem. The optimizer was trained on a machine with torch 1.10.0, is this the root cause? But it's really difficult for me to install torch 1.10.0 on my current machine.

Resuming from a checkpoint in torch==1.12.0 is broken, this was fixed in torch=1.12.1. This workaround allows to load checkpoints with version 1.12.0 as well. In pytorch/pytorch#80809 a 10% slowdown was reported, which I did not observe.

jaried mentioned this issue Jul 4, 2022

capturable=False,报错 babysor/MockingBird#631

Closed

yqi19 closed this as completed Jul 4, 2022

yqi19 reopened this Jul 4, 2022

yqi19 closed this as completed Jul 4, 2022

This was referenced Jul 5, 2022

Pytorch 1.12.0 AttributeError: 'Adam' object has no attribute '_warned_capturable_if_run_uncaptured' #80831

Closed

adam optim ERROR:If capturable=False, state_steps should not be CUDA tensors. thu-ml/tianshou#681

Closed

albanD mentioned this issue Jul 5, 2022

remove overly restrictive checks for cudagraph #80881

Closed

pytorchmergebot pushed a commit that referenced this issue Jul 6, 2022

remove overly restrictive checks for cudagraph (#80881)

9d20af5

Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser

qgallouedec mentioned this issue Jul 16, 2022

[Bug] Training a loaded model fails : If capturable=False, state_steps should not be CUDA tensors. DLR-RM/stable-baselines3#967

Closed

3 tasks

franchesoni mentioned this issue Jul 17, 2022

AssertionError: If capturable=False, state_steps should not be CUDA tensors. Lightning-AI/pytorch-lightning#13695

Closed

tgaddair mentioned this issue Jul 18, 2022

Unable to resume training ludwig-ai/ludwig#2287

Closed

atalman pushed a commit to atalman/pytorch that referenced this issue Jul 21, 2022

remove overly restrictive checks for cudagraph (pytorch#80881)

6090a66

Finish fixing pytorch#80809 Pull Request resolved: pytorch#80881 Approved by: https://github.com/jbschlosser

atalman mentioned this issue Jul 21, 2022

remove overly restrictive checks for cudagraph (#80881) #81858

Merged

atalman added a commit that referenced this issue Jul 21, 2022

remove overly restrictive checks for cudagraph (#80881) (#81858)

cd6ec07

Finish fixing #80809 Pull Request resolved: #80881 Approved by: https://github.com/jbschlosser Co-authored-by: albanD <desmaison.alban@gmail.com>

baudm mentioned this issue Jul 25, 2022

Start from checkpoint fails baudm/parseq#4

Closed

rom1504 mentioned this issue Aug 1, 2022

resume broken in torch 1.12 mlfoundations/open_clip#140

Closed

hubertlu-tw mentioned this issue Aug 1, 2022

Enable FusedRMSNorm ROCm/apex#78

Merged

rybchuk mentioned this issue Aug 5, 2022

Cannot resume training on multiple GPUs Janspiry/Palette-Image-to-Image-Diffusion-Models#31

Closed

jithunnair-amd mentioned this issue Aug 5, 2022

Unit tests failing with "AssertionError: If capturable=False, state_steps should not be CUDA tensors." ROCm/apex#82

Closed

speediedan mentioned this issue Aug 7, 2022

Minor Fine-Tuning Scheduler Tutorial Updates for PTL 1.7 Lightning-AI/tutorials#187

Merged

2 tasks

TParcollet mentioned this issue Aug 10, 2022

Unable to continue training of MetricGAN+ from the saved checkpoint speechbrain/speechbrain#1532

Closed

hubertlu-tw mentioned this issue Aug 10, 2022

IFU-master-2022-07-29 ROCm/apex#80

Merged

BlueFisher added a commit to BlueFisher/Advanced-Soft-Actor-Critic that referenced this issue Aug 18, 2022

several updates

98a6f84

rename sac to rl in agent and sac_main remove temp issue pytorch/pytorch#80809 add depth camera to realcar add dm_control mujoco sample

mjdenkowski mentioned this issue Sep 17, 2022

AssertionError: If capturable=False, state_steps should not be CUDA tensors. awslabs/sockeye#1067

Closed

yidong72 mentioned this issue Oct 8, 2022

fix optim for new pytorch version NVIDIA/NeMo#5126

Closed

pomonam mentioned this issue Oct 14, 2022

Checkpoint utilities to submission_runner.py mlcommons/algorithmic-efficiency#181

Merged

2 tasks

romain-xu-darme pushed a commit to romain-xu-darme/casual-prototree that referenced this issue Nov 3, 2022

Update requirements.txt with torch 1.12.1 (avoid issues when loading …

d6fdd9b

…checkpoints: pytorch/pytorch#80809).

hubertlu-tw mentioned this issue Nov 14, 2022

Unskip some unit tests related to issue #82 ROCm/apex#98

Merged

Harold-lkk mentioned this issue Dec 5, 2022

[Bug] open-mmlab/mmocr#1592

Closed

2 tasks

speediedan mentioned this issue Mar 16, 2023

Fine-Tuning Scheduler Tutorial Updates for Lightning 2.0.x Lightning-AI/tutorials#236

Merged

3 tasks

hmorimitsu mentioned this issue May 8, 2023

training is not working for craft, flowformer, gmflownet, gmflow hmorimitsu/ptlflow#46

Closed

mdmarti mentioned this issue Aug 28, 2023

Issues loading optimizer state, training further under pytorch 1.12.0 pearsonlab/autoencoded-vocal-analysis#6

Closed

liuxinhai mentioned this issue Sep 12, 2023

If capturable=True, params and state_steps must be CUDA tensors. PJLab-ADG/neuralsim#21

Open

alexanderwerning mentioned this issue Nov 7, 2023

Add workaround for torch==1.12.0 bug fgnt/padertorch#151

Merged

jiehuang165 mentioned this issue Mar 7, 2024

[Bug] When using resmue to continue training, AssertionError: If capturable=False, state_steps should not be CUDA tensors. occurs open-mmlab/mmagic#1988

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809

yqi19 commented Jul 3, 2022

jaried commented Jul 3, 2022 •

edited

L0SG commented Jul 4, 2022 •

edited

jaried commented Jul 4, 2022

amrosado commented Jul 4, 2022

yqi19 commented Jul 4, 2022

1lint commented Jul 4, 2022

amrosado commented Jul 4, 2022

albanD commented Jul 5, 2022

Xact-sniper commented Jul 16, 2022 •

edited

franchesoni commented Jul 17, 2022

albanD commented Jul 20, 2022

cliffordkleinsr commented Aug 2, 2022 •

edited

dongsiwen commented Sep 24, 2022

zhilyzhang commented Apr 15, 2023

linminhtoo commented Aug 30, 2023

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors. #80809

Comments

yqi19 commented Jul 3, 2022

🐛 Describe the bug

Versions

jaried commented Jul 3, 2022 • edited

L0SG commented Jul 4, 2022 • edited

jaried commented Jul 4, 2022

amrosado commented Jul 4, 2022

yqi19 commented Jul 4, 2022

1lint commented Jul 4, 2022

amrosado commented Jul 4, 2022

albanD commented Jul 5, 2022

Xact-sniper commented Jul 16, 2022 • edited

franchesoni commented Jul 17, 2022

albanD commented Jul 20, 2022

cliffordkleinsr commented Aug 2, 2022 • edited

dongsiwen commented Sep 24, 2022

zhilyzhang commented Apr 15, 2023

linminhtoo commented Aug 30, 2023

jaried commented Jul 3, 2022 •

edited

L0SG commented Jul 4, 2022 •

edited

Xact-sniper commented Jul 16, 2022 •

edited

cliffordkleinsr commented Aug 2, 2022 •

edited