About 3090 #11

xuchunyu123 · 2022-06-16T10:57:02Z

Hello, looking at other questions, I found that you mentioned that your environment is 3090, but I found that the cuda version supported by 3090 is above 11. How did you solve it?

wuhaixu2016 · 2022-06-18T06:25:54Z

You can change the cuda toolkit version.

elisejiuqizhang · 2022-08-04T21:58:12Z

Hey,

Bravo to the authors @wuhaixu2016 @Jiehui-Xu , great work. I would just like to add one minor comment that might also be relevant to this (not exactly an "issue" tho), with the following modification I believe this implementation could be executed in torch version 1.5 or higher.

So I saw that in README.md it was specified that your implementation was Pytorch 1.4 which is by default with CUDA 10.1.

My GPU is also NVIDIA GeForce RTX 3090 (queried in commandline with nvidia-smi --query-gpu=name --format=csv,noheader) and CUDA version 11.5 so I decided to go with torch 1.7 with CUDA 11.

In my environment, the error that was raised initially seemed to be within solver.py and specifically in the MiniMax training module so it looked like something is not functioning with the AutoGrad. The error message was something like the follows (with torch.autograd.set_detect_anomaly(True) to raise a more detailed error message)

File "main.py", line 54, in <module>
main(config)
File "main.py", line 23, in main
solver.train()
File "/usr/local/data/elisejzh/Projects/Anomaly-Transformer/solver.py", line 191, in train
loss2.backward()
File "/usr/local/data/elisejzh/anaconda3/envs/AnoTrans/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/data/elisejzh/anaconda3/envs/AnoTrans/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 55]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I did a bit of searching in the pytorch forum and found the following explanation:
Link to the first discussion: https://discuss.pytorch.org/t/training-with-two-optimizers-fails-in-torch-1-5-0-works-in-1-1-0/84194
Link to a detailed explanation within the above thread: pytorch/pytorch#39141 (comment)

So basically, in answer to @xuchunyu123 's question (which I believed was also raised in another previous issue by @Alex-seven regarding "gradient computation"), if you wanna use a higher version of Pytorch (1.5 or higher) with CUDA support, the solution seems to be as simple as to just reorganize the lines of Minimax strategy in the train() of solver.py, just put the two XXX.step() behind the XXX.backward(), like this:

loss1.backward(retain_graph=True)
loss2.backward()
self.optimizer.step()
self.optimizer.step()

Then it would work even if your environment is with a higher torch version.

The rationale, as explained in the forum, seems to be that, all the torch versions before 1.4 (1.4 also included) is sort of not exactly computing the correct gradient, which was fixed in versions later (1.5 and higher).

In earlier versions, if you put step() before backward() it would run, but the step() method might change parameters that were supposed to remain intact for backward() gradient computation. So it might be giving wrong gradients in fact.

They fixed it in later versions, so now, if you still organize your code as it would work for versions 1.4 and earlier (as in, step() before the gradient computation in backward()), the autograd would flag it as "unreachable" because it was not supposed to be modified. So just make sure to put the step() method after the gradients have been computed in the backward() method.

I was also wondering if the authors would feel like looking into it and testing this a bit, in some higher torch versions compatible with your GPU's CUDA version, since I believe it could potentially raise more compatibility issues to manually force-install an earlier torch version like 1.4 with a much higher cudatoolkit (e.g., 11.X)

Cheers,
Elise

wuhaixu2016 · 2022-08-05T09:09:22Z

@elisejiuqizhang
Thanks for your wonderful comment. I will test this asap.

wuhaixu2016 · 2022-08-06T07:40:54Z

Hi, I have updated the code in solver.py. Now, it works well in torch>=1.4.0 environments.
Concretely, the updated code is in the following:

loss1.backward(retain_graph=True) loss2.backward() self.optimizer.step()

Just using the step() once is fine. Thanks a lot for your help @elisejiuqizhang .

wuhaixu2016 closed this as completed Jun 18, 2022

wuhaixu2016 reopened this Aug 5, 2022

wuhaixu2016 closed this as completed Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About 3090 #11

About 3090 #11

xuchunyu123 commented Jun 16, 2022

wuhaixu2016 commented Jun 18, 2022

elisejiuqizhang commented Aug 4, 2022 •

edited

wuhaixu2016 commented Aug 5, 2022

wuhaixu2016 commented Aug 6, 2022 •

edited

About 3090 #11

About 3090 #11

Comments

xuchunyu123 commented Jun 16, 2022

wuhaixu2016 commented Jun 18, 2022

elisejiuqizhang commented Aug 4, 2022 • edited

wuhaixu2016 commented Aug 5, 2022

wuhaixu2016 commented Aug 6, 2022 • edited

elisejiuqizhang commented Aug 4, 2022 •

edited

wuhaixu2016 commented Aug 6, 2022 •

edited