Skip to content

Conversation

zou3519
Copy link
Contributor

@zou3519 zou3519 commented Oct 21, 2020

Stack from ghstack:

Summary

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
new_zeros to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}

and consider the following code:

x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
copy_ calls (see #44052). However, it turns out we can actually just
use new_zeros to construct a new Tensor that has the same
"batched-ness" as grad:

auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);

We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives

Option 2: Register evenly_distribute_backward as an operator and have the
vmap fallback run it in a loop.

  • This requires more LOC changes.
  • Furthermore, we'd have to write an efficient batching rule for
    evenly_distribute_backward in the future.
  • If we use new_zeros instead, we don't need to write an efficient
    batching rule for evenly_distribute_backward as long as the
    constituents of evenly_distributed_backward have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.

  • For example, at::zeros(3, 5) could return a Tensor of shape
    (B0, B1, 3, 5) if we are vmapping over two dimensions with size B0 and B1.
    This requires maintaining some global and/or thread-local state about
    the size of the dims being vmapped over which can be tricky.

And more...

Future

Test Plan

  • New tests

Differential Revision: D24456781

Summary
-------

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
`new_zeros` to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem
---------------------------------

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```
and consider the following code:

```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
`copy_` calls (see #44052). However, it turns out we can actually just
use `new_zeros` to construct a new Tensor that has the same
"batched-ness" as grad:
```
auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);
```
We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives
------------
Option 2: Register `evenly_distribute_backward` as an operator and have the
vmap fallback run it in a loop.
- This requires more LOC changes.
- Furthermore, we'd have to write an efficient batching rule for
`evenly_distribute_backward` in the future.
- If we use `new_zeros` instead, we don't need to write an efficient
batching rule for `evenly_distribute_backward` as long as the
constituents of `evenly_distributed_backward` have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.
- For example, `at::zeros(3, 5)` could return a Tensor of shape
`(B0, B1, 3, 5)` if we are vmapping over two dimensions with size B0 and B1.
This requires maintaining some global and/or thread-local state about
the size of the dims being vmapped over which can be tricky.

And more...

Future
------
- I will undo some of the work I’ve done in the past to move backward
functions to being operators (#44052, #44408). The simpler backward
functions (like select backward) can just use Tensor.new_zeros.
I apologize for the thrashing.
- Include a NOTE about the vmap+inplace problem somewhere in the
codebase. I don't have a good idea of where to put it at the moment.

Test Plan
---------
- New tests

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Oct 21, 2020
Summary
-------

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
`new_zeros` to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem
---------------------------------

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```
and consider the following code:

```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
`copy_` calls (see #44052). However, it turns out we can actually just
use `new_zeros` to construct a new Tensor that has the same
"batched-ness" as grad:
```
auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);
```
We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives
------------
Option 2: Register `evenly_distribute_backward` as an operator and have the
vmap fallback run it in a loop.
- This requires more LOC changes.
- Furthermore, we'd have to write an efficient batching rule for
`evenly_distribute_backward` in the future.
- If we use `new_zeros` instead, we don't need to write an efficient
batching rule for `evenly_distribute_backward` as long as the
constituents of `evenly_distributed_backward` have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.
- For example, `at::zeros(3, 5)` could return a Tensor of shape
`(B0, B1, 3, 5)` if we are vmapping over two dimensions with size B0 and B1.
This requires maintaining some global and/or thread-local state about
the size of the dims being vmapped over which can be tricky.

And more...

Future
------
- I will undo some of the work I’ve done in the past to move backward
functions to being operators (#44052, #44408). The simpler backward
functions (like select backward) can just use Tensor.new_zeros.
I apologize for the thrashing.
- Include a NOTE about the vmap+inplace problem somewhere in the
codebase. I don't have a good idea of where to put it at the moment.

Test Plan
---------
- New tests

ghstack-source-id: caeed93
Pull Request resolved: #46674
@zou3519 zou3519 requested review from ezyang and removed request for apaszke October 21, 2020 19:53
@dr-ci
Copy link

dr-ci bot commented Oct 21, 2020

💊 CI failures summary and remediations

As of commit cbf8760 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

} else {
auto mask = value.isnan().item<bool>() ? input.isnan() : input == value;
return at::zeros_like(input).masked_fill_(mask, grad / mask.sum());
return grad.new_zeros(input.sizes(), input.options()).masked_fill_(mask, grad / mask.sum());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that the batch info are taken from grad and the other sizes from input.
I think it is worth mentioning in the "vmap gotcha" (if you have that) that the new_* functions behave this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a vmap gotcha section anywhere, but I'll make a note on that in the future

Summary
-------

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
`new_zeros` to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem
---------------------------------

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```
and consider the following code:

```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
`copy_` calls (see #44052). However, it turns out we can actually just
use `new_zeros` to construct a new Tensor that has the same
"batched-ness" as grad:
```
auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);
```
We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives
------------
Option 2: Register `evenly_distribute_backward` as an operator and have the
vmap fallback run it in a loop.
- This requires more LOC changes.
- Furthermore, we'd have to write an efficient batching rule for
`evenly_distribute_backward` in the future.
- If we use `new_zeros` instead, we don't need to write an efficient
batching rule for `evenly_distribute_backward` as long as the
constituents of `evenly_distributed_backward` have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.
- For example, `at::zeros(3, 5)` could return a Tensor of shape
`(B0, B1, 3, 5)` if we are vmapping over two dimensions with size B0 and B1.
This requires maintaining some global and/or thread-local state about
the size of the dims being vmapped over which can be tricky.

And more...

Future
------
- I will undo some of the work I’ve done in the past to move backward
functions to being operators (#44052, #44408). The simpler backward
functions (like select backward) can just use Tensor.new_zeros.
I apologize for the thrashing.
- Include a NOTE about the vmap+inplace problem somewhere in the
codebase. I don't have a good idea of where to put it at the moment.

Test Plan
---------
- New tests

Differential Revision: [D24456781](https://our.internmc.facebook.com/intern/diff/D24456781)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Oct 23, 2020
Summary
-------

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
`new_zeros` to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem
---------------------------------

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```
and consider the following code:

```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
`copy_` calls (see #44052). However, it turns out we can actually just
use `new_zeros` to construct a new Tensor that has the same
"batched-ness" as grad:
```
auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);
```
We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives
------------
Option 2: Register `evenly_distribute_backward` as an operator and have the
vmap fallback run it in a loop.
- This requires more LOC changes.
- Furthermore, we'd have to write an efficient batching rule for
`evenly_distribute_backward` in the future.
- If we use `new_zeros` instead, we don't need to write an efficient
batching rule for `evenly_distribute_backward` as long as the
constituents of `evenly_distributed_backward` have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.
- For example, `at::zeros(3, 5)` could return a Tensor of shape
`(B0, B1, 3, 5)` if we are vmapping over two dimensions with size B0 and B1.
This requires maintaining some global and/or thread-local state about
the size of the dims being vmapped over which can be tricky.

And more...

Future
------
- I will undo some of the work I’ve done in the past to move backward
functions to being operators (#44052, #44408). The simpler backward
functions (like select backward) can just use Tensor.new_zeros.
I apologize for the thrashing.
- Include a NOTE about the vmap+inplace problem somewhere in the
codebase. I don't have a good idea of where to put it at the moment.

Test Plan
---------
- New tests

ghstack-source-id: 5cf5f04
Pull Request resolved: #46674
@codecov
Copy link

codecov bot commented Oct 23, 2020

Codecov Report

Merging #46674 into gh/zou3519/316/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@                 Coverage Diff                  @@
##           gh/zou3519/316/base   #46674   +/-   ##
====================================================
  Coverage                68.98%   68.98%           
====================================================
  Files                      433      433           
  Lines                    55921    55921           
====================================================
+ Hits                     38578    38579    +1     
+ Misses                   17343    17342    -1     

@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 74d8108.

@facebook-github-bot facebook-github-bot deleted the gh/zou3519/316/head branch October 27, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants