as_strided batching rule #47224

zou3519 · 2020-11-02T21:16:02Z

Stack from ghstack:

Batched gradient support for view+inplace operations #47227 Batched gradient support for view+inplace operations
Batching rule for Tensor.new_empty_strided #47226 Batching rule for Tensor.new_empty_strided
Implement Tensor.new_empty_strided(sizes, strides, *, dtype, device, requires_grad) #47225 Implement Tensor.new_empty_strided(sizes, strides, *, dtype, device, requires_grad)
as_strided batching rule #47224 as_strided batching rule
Batched grad for advanced indexing (index) #47223 Batched grad for advanced indexing (index)
Fix sum batching rule, add simple clone batching rule #47189 Fix sum batching rule, add simple clone batching rule
Batching rules for complex view functions #47188 Batching rules for complex view functions

This PR adds a batching rule for as_strided. as_strided is a really weird
operation and I hope that users don't use it very much.

Motivation

The motivation for adding a batching rule for as_strided is for
batched gradient computation.

AsStridedBackward appears in PyTorch when handling view+in-place
operations and calls as_strided. AsStridedBackward calls as_strided on
a fresh tensor with storage_offset equal to 0. We would like to be able
to vmap through the backward graph of view+in-place operations to
for batched gradient computation, especially because internally we have
a number of functions that are implemented as a view+in-place.

Alternatives

If we think that as_strided is too crazy to have a batching rule, we
could either:

have a flag that controls the autograd view+in-place
behavior
require that the input tensor's storage offset must be equal to 0
to make it easier to reason about.

I think the batching rule makes sense, so I didn't pursue the
alternatives.

The batching rule

y = vmap(lambda x: x.as_strided(sizes, strides, offset))(xs)

The result of the above should be "equivalent" to:

Assume that each x has storage offset equal to xs.storage_offset()
(call that S).
Calling as_strided with (sizes, sizes, offset + x[i].storage_offset() - S) on each x.

More concretely,
this returns a view on xs, such that each y[i] has:

sizes: sizes
strides: strides
storage_offset: offset + i * x.stride(batch_dim)

Why the behavior can be weird

The behavior of the batching rule may be different from actually running
as_strided in a for-loop because as_strided takes in offset as a
"absolute offset". As an example, consider

>>> x = torch.tensor([0., 1., 2., 3., 4.])
>>> z = [x[i].as_strided([1], [1], 1) for i in range(5)]

Each z[i] is actually the same view on x (z[i] == torch.tensor([0.]))!
However, we consider the above for-loop comprehension to be a user error:
a user should have written the following if they wanted to use as_strided
in a per-sample way:

>>> z = [x[i].as_strided([1], [1], 1 + x[i].storage_offset()) for i in range(4)]

Test Plan

Added some tests that compare vmap+as_strided to vmap+(the equivalent operator)

This PR adds a batching rule for as_strided. `as_strided` is a really weird operation and I hope that users don't use it very much. Motivation ---------- The motivation for adding a batching rule for as_strided is for batched gradient computation. AsStridedBackward appears in PyTorch when handling view+in-place operations and calls `as_strided`. AsStridedBackward calls as_strided on a fresh tensor with storage_offset equal to 0. We would like to be able to vmap through the backward graph of view+in-place operations to for batched gradient computation, especially because internally we have a number of functions that are implemented as a view+in-place. Alternatives ------------ If we think that as_strided is too crazy to have a batching rule, we could either: - have a flag that controls the autograd view+in-place behavior - require that the input tensor's storage offset must be equal to 0 to make it easier to reason about. I think the batching rule makes sense, so I didn't pursue the alternatives. The batching rule ----------------- ``` y = vmap(lambda x: x.as_strided(sizes, strides, offset))(xs) ``` The result of the above should be "equivalent" to: - Assume that each x has storage offset equal to xs.storage_offset() (call that S). - Calling as_strided with (sizes, sizes, offset + x[i].storage_offset() - S) on each x. More concretely, this returns a view on `xs`, such that each y[i] has: - sizes: `sizes` - strides: `strides` - storage_offset: offset + i * x.stride(batch_dim) Why the behavior can be weird ----------------------------- The behavior of the batching rule may be different from actually running as_strided in a for-loop because `as_strided` takes in `offset` as a "absolute offset". As an example, consider ``` >>> x = torch.tensor([0., 1., 2., 3., 4.]) >>> z = [x[i].as_strided([1], [1], 1) for i in range(5)] ``` Each z[i] is actually the same view on x (z[i] == torch.tensor([0.]))! However, we consider the above for-loop comprehension to be a user error: a user should have written the following if they wanted to use as_strided in a per-sample way: ``` >>> z = [x[i].as_strided([1], [1], 1 + x[i].storage_offset()) for i in range(4)] ``` Test Plan --------- - Added some tests that compare vmap+as_strided to vmap+(the equivalent operator) [ghstack-poisoned]

dr-ci · 2020-11-02T21:29:56Z

💊 CI failures summary and remediations

As of commit dd4e5ea (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

This PR adds a batching rule for as_strided. `as_strided` is a really weird operation and I hope that users don't use it very much. Motivation ---------- The motivation for adding a batching rule for as_strided is for batched gradient computation. AsStridedBackward appears in PyTorch when handling view+in-place operations and calls `as_strided`. AsStridedBackward calls as_strided on a fresh tensor with storage_offset equal to 0. We would like to be able to vmap through the backward graph of view+in-place operations to for batched gradient computation, especially because internally we have a number of functions that are implemented as a view+in-place. Alternatives ------------ If we think that as_strided is too crazy to have a batching rule, we could either: - have a flag that controls the autograd view+in-place behavior - require that the input tensor's storage offset must be equal to 0 to make it easier to reason about. I think the batching rule makes sense, so I didn't pursue the alternatives. The batching rule ----------------- ``` y = vmap(lambda x: x.as_strided(sizes, strides, offset))(xs) ``` The result of the above should be "equivalent" to: - Assume that each x has storage offset equal to xs.storage_offset() (call that S). - Calling as_strided with (sizes, sizes, offset + x[i].storage_offset() - S) on each x. More concretely, this returns a view on `xs`, such that each y[i] has: - sizes: `sizes` - strides: `strides` - storage_offset: offset + i * x.stride(batch_dim) Why the behavior can be weird ----------------------------- The behavior of the batching rule may be different from actually running as_strided in a for-loop because `as_strided` takes in `offset` as a "absolute offset". As an example, consider ``` >>> x = torch.tensor([0., 1., 2., 3., 4.]) >>> z = [x[i].as_strided([1], [1], 1) for i in range(5)] ``` Each z[i] is actually the same view on x (z[i] == torch.tensor([0.]))! However, we consider the above for-loop comprehension to be a user error: a user should have written the following if they wanted to use as_strided in a per-sample way: ``` >>> z = [x[i].as_strided([1], [1], 1 + x[i].storage_offset()) for i in range(4)] ``` Test Plan --------- - Added some tests that compare vmap+as_strided to vmap+(the equivalent operator) [ghstack-poisoned]

albanD · 2020-11-03T14:21:05Z

In your example above, you should have z[i] == torch.tensor([1.]) (and not 0.) right? As the offset you give is 1?

albanD

I am not sure what will be the general structure for the vmap doc. Should we plan to add a note there about as_strided?

albanD · 2020-11-03T14:24:24Z