-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batching rule for torch.clone(tensor, torch.contiguous_format) #47365
Conversation
I wanted to avoid defining vmap behavior over contiguous_format for as long as possible. This is potentially ambiguous, consider the following: ``` >>> x = torch.randn(3, B0, 5) >>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1, out_dims=1)(x) >>> y[:,0].is_contiguous() # ?? ``` There are two possible ways to interpret this operation (if we choose to allow it to succeed): 1. Each per-sample becomes contiguous, so y[:,0] is contiguous. 2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is not) (1) makes more sense because vmap operates on a per-sample level. This makes sense when combined with the vmap fallback: - there are places in the codebase where we perform .contiguous() and then pass the result to an operator `op` that only accepts contiguous inputs. - If we vmap over such code and don't have a batching rule implemented for `op`, then we want the per-samples to be contiguous so that when `op` goes through the vmap fallback, it receives contiguous per-samples. (1) is the approach we've selected for this PR. Motivation ---------- To vmap over CopySlices, we have to vmap over a clone(contiguous_format) call: https://github.com/pytorch/pytorch/blob/e4bc785dd57b15ae091eb8e8ca71a604da9b3fb2/torch/csrc/autograd/functions/tensor.cpp#L93 Alternatives ------------ - Implementing (2) is difficult in the current design because vmap is allowed to move batch dimensions to the front of the tensor. We would need some global information about the in_dims and out_dims passed to vmap. - We could also error out if someone calls clone(contiguous_format) and the batch dims are not at the front. This would resolve the ambiguity at the cost of limiting what vmap can do. Future Work ----------- - Add to a "vmap gotchas" page the behavior of contiguous_format. - Implement is_contiguous, Tensor.contiguous() with the same semantics. Those currently error out. Test Plan --------- - new tests [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 84951db (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build (1/1)Step: "Checkout pytorch/builder repo" (full log | diagnosis details | 🔁 rerun)
|
…_format)" I wanted to avoid defining vmap behavior over contiguous_format for as long as possible. This is potentially ambiguous, consider the following: ``` >>> x = torch.randn(3, B0, 5) >>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1, out_dims=1)(x) >>> y[:,0].is_contiguous() # ?? ``` There are two possible ways to interpret this operation (if we choose to allow it to succeed): 1. Each per-sample becomes contiguous, so y[:,0] is contiguous. 2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is not) (1) makes more sense because vmap operates on a per-sample level. This makes sense when combined with the vmap fallback: - there are places in the codebase where we perform .contiguous() and then pass the result to an operator `op` that only accepts contiguous inputs. - If we vmap over such code and don't have a batching rule implemented for `op`, then we want the per-samples to be contiguous so that when `op` goes through the vmap fallback, it receives contiguous per-samples. (1) is the approach we've selected for this PR. Motivation ---------- To vmap over CopySlices, we have to vmap over a clone(contiguous_format) call: https://github.com/pytorch/pytorch/blob/e4bc785dd57b15ae091eb8e8ca71a604da9b3fb2/torch/csrc/autograd/functions/tensor.cpp#L93 Alternatives ------------ - Implementing (2) is difficult in the current design because vmap is allowed to move batch dimensions to the front of the tensor. We would need some global information about the in_dims and out_dims passed to vmap. - We could also error out if someone calls clone(contiguous_format) and the batch dims are not at the front. This would resolve the ambiguity at the cost of limiting what vmap can do. Future Work ----------- - Add to a "vmap gotchas" page the behavior of contiguous_format. - Implement is_contiguous, Tensor.contiguous() with the same semantics. Those currently error out. Test Plan --------- - new tests [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
…_format)" I wanted to avoid defining vmap behavior over contiguous_format for as long as possible. This is potentially ambiguous, consider the following: ``` >>> x = torch.randn(3, B0, 5) >>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1, out_dims=1)(x) >>> y[:,0].is_contiguous() # ?? ``` There are two possible ways to interpret this operation (if we choose to allow it to succeed): 1. Each per-sample becomes contiguous, so y[:,0] is contiguous. 2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is not) (1) makes more sense because vmap operates on a per-sample level. This makes sense when combined with the vmap fallback: - there are places in the codebase where we perform .contiguous() and then pass the result to an operator `op` that only accepts contiguous inputs. - If we vmap over such code and don't have a batching rule implemented for `op`, then we want the per-samples to be contiguous so that when `op` goes through the vmap fallback, it receives contiguous per-samples. (1) is the approach we've selected for this PR. Motivation ---------- To vmap over CopySlices, we have to vmap over a clone(contiguous_format) call: https://github.com/pytorch/pytorch/blob/e4bc785dd57b15ae091eb8e8ca71a604da9b3fb2/torch/csrc/autograd/functions/tensor.cpp#L93 Alternatives ------------ - Implementing (2) is difficult in the current design because vmap is allowed to move batch dimensions to the front of the tensor. We would need some global information about the in_dims and out_dims passed to vmap. - We could also error out if someone calls clone(contiguous_format) and the batch dims are not at the front. This would resolve the ambiguity at the cost of limiting what vmap can do. Future Work ----------- - Add to a "vmap gotchas" page the behavior of contiguous_format. - Implement is_contiguous, Tensor.contiguous() with the same semantics. Those currently error out. Test Plan --------- - new tests [ghstack-poisoned]
Thank you for the PR descriptions, they are super clear. |
I'm trying to think if there is any justification for why (2) is right and pulling up a blank. vmap is all about making local per-sample decisions which then can be lifted into global batched variant. So I feel like you are obligated to preserve the local invariant that |
I agree, I think I am overthinking the problem |
…_format)" I wanted to avoid defining vmap behavior over contiguous_format for as long as possible. This is potentially ambiguous, consider the following: ``` >>> x = torch.randn(3, B0, 5) >>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1, out_dims=1)(x) >>> y[:,0].is_contiguous() # ?? ``` There are two possible ways to interpret this operation (if we choose to allow it to succeed): 1. Each per-sample becomes contiguous, so y[:,0] is contiguous. 2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is not) (1) makes more sense because vmap operates on a per-sample level. This makes sense when combined with the vmap fallback: - there are places in the codebase where we perform .contiguous() and then pass the result to an operator `op` that only accepts contiguous inputs. - If we vmap over such code and don't have a batching rule implemented for `op`, then we want the per-samples to be contiguous so that when `op` goes through the vmap fallback, it receives contiguous per-samples. (1) is the approach we've selected for this PR. Motivation ---------- To vmap over CopySlices, we have to vmap over a clone(contiguous_format) call: https://github.com/pytorch/pytorch/blob/e4bc785dd57b15ae091eb8e8ca71a604da9b3fb2/torch/csrc/autograd/functions/tensor.cpp#L93 Alternatives ------------ - Implementing (2) is difficult in the current design because vmap is allowed to move batch dimensions to the front of the tensor. We would need some global information about the in_dims and out_dims passed to vmap. - We could also error out if someone calls clone(contiguous_format) and the batch dims are not at the front. This would resolve the ambiguity at the cost of limiting what vmap can do. Future Work ----------- - Add to a "vmap gotchas" page the behavior of contiguous_format. - Implement is_contiguous, Tensor.contiguous() with the same semantics. Those currently error out. Test Plan --------- - new tests Differential Revision: [D24741683](https://our.internmc.facebook.com/intern/diff/D24741683) [ghstack-poisoned]
…_format)" I wanted to avoid defining vmap behavior over contiguous_format for as long as possible. This is potentially ambiguous, consider the following: ``` >>> x = torch.randn(3, B0, 5) >>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1, out_dims=1)(x) >>> y[:,0].is_contiguous() # ?? ``` There are two possible ways to interpret this operation (if we choose to allow it to succeed): 1. Each per-sample becomes contiguous, so y[:,0] is contiguous. 2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is not) (1) makes more sense because vmap operates on a per-sample level. This makes sense when combined with the vmap fallback: - there are places in the codebase where we perform .contiguous() and then pass the result to an operator `op` that only accepts contiguous inputs. - If we vmap over such code and don't have a batching rule implemented for `op`, then we want the per-samples to be contiguous so that when `op` goes through the vmap fallback, it receives contiguous per-samples. (1) is the approach we've selected for this PR. Motivation ---------- To vmap over CopySlices, we have to vmap over a clone(contiguous_format) call: https://github.com/pytorch/pytorch/blob/e4bc785dd57b15ae091eb8e8ca71a604da9b3fb2/torch/csrc/autograd/functions/tensor.cpp#L93 Alternatives ------------ - Implementing (2) is difficult in the current design because vmap is allowed to move batch dimensions to the front of the tensor. We would need some global information about the in_dims and out_dims passed to vmap. - We could also error out if someone calls clone(contiguous_format) and the batch dims are not at the front. This would resolve the ambiguity at the cost of limiting what vmap can do. Future Work ----------- - Add to a "vmap gotchas" page the behavior of contiguous_format. - Implement is_contiguous, Tensor.contiguous() with the same semantics. Those currently error out. Test Plan --------- - new tests Differential Revision: [D24741683](https://our.internmc.facebook.com/intern/diff/D24741683) [ghstack-poisoned]
Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests [ghstack-poisoned]
Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests Differential Revision: [D24840975](https://our.internmc.facebook.com/intern/diff/D24840975) [ghstack-poisoned]
Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests Differential Revision: [D24840975](https://our.internmc.facebook.com/intern/diff/D24840975) [ghstack-poisoned]
Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests Differential Revision: [D24840975](https://our.internmc.facebook.com/intern/diff/D24840975) [ghstack-poisoned]
Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests Differential Revision: [D24840975](https://our.internmc.facebook.com/intern/diff/D24840975) [ghstack-poisoned]
Summary: Pull Request resolved: #47621 Followup to #47365. is_contiguous on BatchedTensorImpl is implemented as: - Whenever one creates a BatchedTensorImpl, we cache the strides of the per-examples, just like how we cache the sizes of the per-examples. - With the cached strides, we use TensorImpl::refresh_contiguous() to compute if the tensor is contiguous or not. - is_contiguous checks the `is_contiguous_` flag that refresh_contiguous() populates. Both contiguous and is_contiguous only support torch.contiguous_format. I'm not sure what the semantics should be for other memory formats; they are also rank dependent (e.g., channels_last tensor must have 4 dimensions) which makes this a bit tricky. Test Plan: - new tests Reviewed By: Chillee, anjali411 Differential Revision: D24840975 Pulled By: zou3519 fbshipit-source-id: 4d86dbf11e2eec45f3f08300ae3f2d79615bb99d
Stack from ghstack:
I wanted to avoid defining vmap behavior over contiguous_format for as
long as possible. This is potentially ambiguous, consider the following:
There are two possible ways to interpret this operation (if we choose to
allow it to succeed):
not)
(1) makes more sense because vmap operates on a per-sample level.
This makes sense when combined with the vmap fallback:
then pass the result to an operator
op
that only accepts contiguousinputs.
op
, then we want the per-samples to be contiguous so thatwhen
op
goes through the vmap fallback, it receives contiguousper-samples.
(1) is the approach we've selected for this PR.
Motivation
To vmap over CopySlices, we have to vmap over a clone(contiguous_format)
call:
pytorch/torch/csrc/autograd/functions/tensor.cpp
Line 93 in e4bc785
Alternatives
allowed to move batch dimensions to the front of the tensor. We would
need some global information about the in_dims and out_dims passed to
vmap.
the batch dims are not at the front. This would resolve the ambiguity at
the cost of limiting what vmap can do.
Future Work
Those currently error out.
Test Plan
Differential Revision: D24741683