-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dont mutate tensor stride in place in cudnn conv #126786
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126786
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0734941 with merge base 7e166e8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 5109bc89093fb2e0dca22675133c3ffbf9ee8633 Pull Request resolved: #126786
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Does this need to also be fixed for legacy API use-cases that are creating TensorDescriptor
s by passing Tensor
s directly and calling .stride()
on them? e.g.,
void TensorDescriptor::set(const at::Tensor &t, at::MemoryFormat memory_format, size_t pad) { |
What are the callsites of those apis ? Do they invoke |
We don't compile RNNs in torch.compile - would prefer separate issue/fix for the other legacy callsites. |
@eellison Yes I think overloads of pytorch/aten/src/ATen/cudnn/Descriptors.h Line 169 in b40fb2d
fixSizeOneDimStride . The callsites are IIRC RNN, Conv v7, and maybe other legacy API places like batchnorm?
|
Fix for #126241. Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: https://github.com/NVIDIA/cudnn-frontend/blob/d7ccb5b3c47b4de709604cce463ad66b775b7812/include/cudnn_frontend_Tensor.h#L201-L203. [ghstack-poisoned]
ghstack-source-id: 3de7eb31768a05fe2e9686b1b2c183fa8aee31c6 Pull Request resolved: #126786
Fix for #126241. Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: https://github.com/NVIDIA/cudnn-frontend/blob/d7ccb5b3c47b4de709604cce463ad66b775b7812/include/cudnn_frontend_Tensor.h#L201-L203. [ghstack-poisoned]
ghstack-source-id: bf7ad7fcf19506b9387511d895e4ce412903a385 Pull Request resolved: #126786
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine as bandaid but better would be to refactor call sites
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Fix for #126241.
Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: https://github.com/NVIDIA/cudnn-frontend/blob/d7ccb5b3c47b4de709604cce463ad66b775b7812/include/cudnn_frontend_Tensor.h#L201-L203.