Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About 3090 #11

Closed
xuchunyu123 opened this issue Jun 16, 2022 · 4 comments
Closed

About 3090 #11

xuchunyu123 opened this issue Jun 16, 2022 · 4 comments

Comments

@xuchunyu123
Copy link

Hello, looking at other questions, I found that you mentioned that your environment is 3090, but I found that the cuda version supported by 3090 is above 11. How did you solve it?

@wuhaixu2016
Copy link
Collaborator

You can change the cuda toolkit version.

@elisejiuqizhang
Copy link

elisejiuqizhang commented Aug 4, 2022

Hey,

Bravo to the authors @wuhaixu2016 @Jiehui-Xu , great work. I would just like to add one minor comment that might also be relevant to this (not exactly an "issue" tho), with the following modification I believe this implementation could be executed in torch version 1.5 or higher.

So I saw that in README.md it was specified that your implementation was Pytorch 1.4 which is by default with CUDA 10.1.

My GPU is also NVIDIA GeForce RTX 3090 (queried in commandline with nvidia-smi --query-gpu=name --format=csv,noheader) and CUDA version 11.5 so I decided to go with torch 1.7 with CUDA 11.

In my environment, the error that was raised initially seemed to be within solver.py and specifically in the MiniMax training module so it looked like something is not functioning with the AutoGrad. The error message was something like the follows (with torch.autograd.set_detect_anomaly(True) to raise a more detailed error message)

File "main.py", line 54, in <module>
main(config)
File "main.py", line 23, in main
solver.train()
File "/usr/local/data/elisejzh/Projects/Anomaly-Transformer/solver.py", line 191, in train
loss2.backward()
File "/usr/local/data/elisejzh/anaconda3/envs/AnoTrans/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/data/elisejzh/anaconda3/envs/AnoTrans/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 55]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I did a bit of searching in the pytorch forum and found the following explanation:
Link to the first discussion: https://discuss.pytorch.org/t/training-with-two-optimizers-fails-in-torch-1-5-0-works-in-1-1-0/84194
Link to a detailed explanation within the above thread: pytorch/pytorch#39141 (comment)

So basically, in answer to @xuchunyu123 's question (which I believed was also raised in another previous issue by @Alex-seven regarding "gradient computation"), if you wanna use a higher version of Pytorch (1.5 or higher) with CUDA support, the solution seems to be as simple as to just reorganize the lines of Minimax strategy in the train() of solver.py, just put the two XXX.step() behind the XXX.backward(), like this:

loss1.backward(retain_graph=True)
loss2.backward()
self.optimizer.step()
self.optimizer.step()

Then it would work even if your environment is with a higher torch version.

The rationale, as explained in the forum, seems to be that, all the torch versions before 1.4 (1.4 also included) is sort of not exactly computing the correct gradient, which was fixed in versions later (1.5 and higher).

In earlier versions, if you put step() before backward() it would run, but the step() method might change parameters that were supposed to remain intact for backward() gradient computation. So it might be giving wrong gradients in fact.

They fixed it in later versions, so now, if you still organize your code as it would work for versions 1.4 and earlier (as in, step() before the gradient computation in backward()), the autograd would flag it as "unreachable" because it was not supposed to be modified. So just make sure to put the step() method after the gradients have been computed in the backward() method.

I was also wondering if the authors would feel like looking into it and testing this a bit, in some higher torch versions compatible with your GPU's CUDA version, since I believe it could potentially raise more compatibility issues to manually force-install an earlier torch version like 1.4 with a much higher cudatoolkit (e.g., 11.X)

Cheers,
Elise

@wuhaixu2016 wuhaixu2016 reopened this Aug 5, 2022
@wuhaixu2016
Copy link
Collaborator

@elisejiuqizhang
Thanks for your wonderful comment. I will test this asap.

@wuhaixu2016
Copy link
Collaborator

wuhaixu2016 commented Aug 6, 2022

Hi, I have updated the code in solver.py. Now, it works well in torch>=1.4.0 environments.
Concretely, the updated code is in the following:

loss1.backward(retain_graph=True) loss2.backward() self.optimizer.step()

Just using the step() once is fine. Thanks a lot for your help @elisejiuqizhang .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants