-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot access data pointer of Tensor that doesn't have storage #2652
Comments
Hi |
Pytorch/XLA tensor does not have storage and won't try to access the storage. This might be that we used default kernel for one of the op and that implementation tried to access the storage. @ailzhang Could you take a look? |
@tmabraham My guess is that pytorch changed one of its default kernel for some op to access the data pointer of the tensor. We saw this kind of error come up sometimes and we usually asked pytorch folks to fix their default kernel to not access storage since it should not assume every backend has storage. |
Hi @tmabraham , would you mind providing a minimal repro script? That'd help locate the bug quicker, thanks a lot!! |
@ailzhang Upon further investigation, I think this bug is specific to EfficientNet (at least the one implemented in timm). Here is a Kaggle Kernel where I replaced the Colab example with timm's EfficientNet model. Since this is a different scenario than the seq2seq model that the original poster had issues with, should I open a new issue? |
@ailzhang Just wanted to follow up regarding whether I should make a separate issue for the EfficientNet bug. |
@rabeehk I ran into this problem when i train an ASR model, because there's a CUDA op that's not supported in xla, so when we sink into this op, GPU cannot access data from the pointer of tensor. What i did is change tensor device to cuda before this op and change the output tensor's device of this op to xla device. And that will fix the problem |
Hi
could you tell me which operation specifically?
thanks
…On Fri, Dec 11, 2020 at 3:36 PM Dalong ***@***.***> wrote:
@rabeehk <https://github.com/rabeehk> I ran into this problem when i
train an ASR model, because there's a CUDA op that's not supported in xla,
so when we sink into this op, GPU cannot access data from the pointer of
tensor. What i did is change tensor device to cuda before this op and
change the output tensor's device of this op to xla device. And that will
fix the problem
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2652 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABP4ZCDBNIH6HBJXOOGNDEDSUIU7HANCNFSM4UFMTSWA>
.
|
@ailzhang sorry for repeatedly tagging you, but I am hoping that issue can get resolved, because I would love to be able to train EfficientNet models from the timm package with PyTorch XLA. |
Hi @tmabraham , sorry for the late reply! I took a look and pytorch/pytorch#49439 should fix the issue. It'll take some time to review and land the fix, but I'll let you know once it's ready for you to try out a new nightly! |
Sounds great, thanks for the update! |
@tmabraham The fix has been landed yesterday, I believe today's nightly should work, would you mind giving it a try? Thanks! |
@ailzhang Thanks for letting me know! It looks like it's running, but the loss, accuracy are way off and the time it takes to run is quite slow. Maybe that should be a different issue though. |
@tmabraham Yea feel free to open a new issue with the perf report! I'm going to close this issue for now since it's fixed. Thanks for report! |
Hi,
I am running a code on pytorch XLA 1.7, python 3.7, and I have getting the following error. The line it happens it is computing the loss. the code runs fine on GPU. To give more context, I am using seq2seq model from huggingface repo, but I modified their code and added adapter layers, then I set all parameters of the model to grad = False and keep only some adapter layers as parameters to finetune. Thank you for the help.
I am happy to provide the code to reproduce the error . to explain more,
I am defining a model like:
and then in the main loop I put the gradient of all_parameters expect the ones inside the Adapter to False.
I call model.to(device) and then compute the loss. Now the parameters of Adapter are the only parameters which can have gradient.
Thank you.
The text was updated successfully, but these errors were encountered: