Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Eliminate SumToSize by using Optional Lists #18697

Closed
wants to merge 30 commits into from

Conversation

@t-vi
Copy link
Collaborator

@t-vi t-vi commented Apr 1, 2019

This PR is a eliminates unneeded grad_sum_to_size and in particular speeds up the LSTM backward by allowing better fusion.

It consists of two parts:

  • In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
  • The specialization of Optional arguments (#18407) allows us to then eliminate _grad_sum_to_size(t, None) in the peephole optimization step.

Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.

I'm testing that different broadcasting situations lead to different graphs.

I didn't move all symbolic_script _grad_sum_to_size to the new logic, but it might be better to do this incrementally, anyway.

t-vi added 6 commits Mar 24, 2019
In pytorch#18360, we used undefined Tensor (aka AutogradZeroTensor),
but this can be errorprone when the type or value is compared
to None, e.g. as seen when comined with the (not yet landed)

For this to work, we must allow None passed to functions
taking Tensor?.
This PR is a proposed alternative to pytorch#18120 and would achieve
very similar fusion (in particular for LSTM backward).

It consists of three parts:
- Specialize Non-Tensor Optional inputs to graphs to be
  either of NoneType or of the elementType.
  This needs the graph spec to be different for the two cases.
- In AutoDiff, record broadcasting sizes only if the
  broadcast output size is different from the input size,
  otherwise record None.
- The specialization allows us to eliminate
  _grad_sum_to_size(t, None) in the peephole optimization
  step.

Thus, in the LSTM case, no SumToSize remain in the crucial fusion
group. The trick here is that we can specialize on the runtime
information from the forward.

I label this WIP because I didn't integrate tests yet and because
I didn't move all symbolic_script _grad_sum_to_size to the new
logic.

However, it would be great to have some discussion given that
some eyebrows in implementation details.
@ngimel
Copy link
Contributor

@ngimel ngimel commented Apr 2, 2019

It would be nice to add a test, e.g. run scripted function for some inputs, then test that .graph_for for inputs with the different broadcasting pattern errors out.

@apaszke
Copy link
Contributor

@apaszke apaszke commented Apr 2, 2019

It would also be great if we could get some perf numbers to ensure that we don't slow down forward.

@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented Apr 2, 2019

Unfortunately we do seem to slow down the forward significantly (4%).

@apaszke
Copy link
Contributor

@apaszke apaszke commented Apr 3, 2019

Ok, that should be fixable, because it's not like we're doing significantly more work.

@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented Apr 4, 2019

@ngimel pointed out a problem with this: The forward pass seems to instantiate intermediate results where previously we didn't. Thanks!

So after fixing this by taking passing the sizes, the forward performance is back and I could get rid of the bogus autodiff additions. But in a way, it's a bit of a mess that we now have broadcasting handling in autodiff and the graph_fuser that doesn't really interact (so e.g. prim::BroadcastSizes could also give me the information if broadcasting happened).

Hm. And merging the optional_None PR branch was a medium quality idea, because now we don't get the diff separately. The files changed relative to that are

a/aten/src/ATen/core/interned_strings.h
a/torch/csrc/jit/autodiff.cpp
a/torch/csrc/jit/passes/graph_fuser.cpp
a/torch/csrc/jit/passes/peephole.cpp (the changes relating to _grad_sum_to_size)
a/torch/csrc/jit/register_prim_ops.cpp
a/torch/csrc/jit/symbolic_script.cpp
a/torch/csrc/jit/symbolic_variable.h

facebook-github-bot added a commit that referenced this issue May 6, 2019
…uting graph (#18407)

Summary:
This patch specializes `Optional[Tensor]` graph inputs to either a `DimensionedTensorType` (if a Tensor is passed) or `NoneType`. Other `Optional[T]` are specialized to `T` or `None`.

- For unwrapping (checked and unchecked) we need to keep the output type, as IR code that follows unwrapping may not work with NoneType (just as it doesn't deal with Optional). While it would not be hit during execution, it will run against the (legitimate) assumptions of the analysis passes.
- Function lookup currently will not match NoneType when it expects optional (I'm not entirely sure why this doesn't lead to unhappyness currently, but hey), I amend this at the level of the function matching code (`operator.cpp`), but see Adam's comments. We would run into trouble if we needed to select between functions whose signature only differs in Optional types with different subtypes, but we would have the same problem when calling them directly, so I would think this is OK.

- It would enable throwing away branches we can't hit. This also reduces the "blockyness" of the graph, so it may be easier to apply optimizations (e.g. fuse things in `if t is None: ...` and outside the `if`.
- Arguments passed into `Optional[Tensor]` arguments will get shape information, which is very handy.
- It get's rid of the problem that tensors passed into Optional arguments get requires_grad set erroneously #18270 (though that also affects lists, which aren't fixed here).
- `Optional[List[int]]` is needed for #18697.

- We're changing typing in a more subtle way than the `TensorType`->`DimensionedTensorType`.
- In particular, specializing to NoneType loses the Type information captured in the `OptionalType` element type.
Pull Request resolved: #18407

Reviewed By: zdevito

Differential Revision: D15216808

Pulled By: eellison

fbshipit-source-id: 01f1a7643deaf4962c3f55eff2070d54b0e54b69
@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 7, 2019

Seems like this needs a rather thorough rebase, I'll try to look into it soonish.

@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 8, 2019

The failures should go away after #20269, but I still need to do the tests.
I think #19988 provided what was needed.

@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 9, 2019

@pytorchbot rebase this please

@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 9, 2019

Numbers for the default benchmark config for me:

This

name avg_fwd std_fwd avg_bwd std_bwd
jit 8.667 0.1126 12.91 0.2302
jit_premul 7.204 0.0088 10 0.07453
cudnn 6.182 0.04415 7.533 0.09015

Master

name avg_fwd std_fwd avg_bwd std_bwd
jit 8.68 0.05878 14.88 0.1656
jit_premul 7.239 0.008265 12.13 0.03549
cudnn 6.197 0.03018 7.536 0.07714

@t-vi t-vi changed the title [WIP] JIT: Elmiminate SumToSize by using Optional Lists JIT: Elmiminate SumToSize by using Optional Lists May 9, 2019
@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 10, 2019

So unless wild things happen in the CI (and aren't already happening in master), I think this might be good.
🚀

Thank you for all your help on this and the related patches.

Copy link
Contributor

@wanchaol wanchaol left a comment

looks nice! I think it's good to go, only have some nits :)

torch/csrc/jit/autodiff.cpp Show resolved Hide resolved
torch/csrc/jit/passes/peephole.cpp Show resolved Hide resolved
torch/csrc/jit/passes/graph_fuser.cpp Show resolved Hide resolved
@wanchaol
Copy link
Contributor

@wanchaol wanchaol commented May 22, 2019

hmm it seems like the test failures are legit..

@t-vi t-vi changed the title JIT: Elmiminate SumToSize by using Optional Lists JIT: Eliminate SumToSize by using Optional Lists May 23, 2019
@t-vi
Copy link
Collaborator Author

@t-vi t-vi commented May 23, 2019

So the failure pytorch_linux_xenial_cuda9_cudnn7_py2_test is in "my test". I'm trying to output the graph to see what's going on.
I don't have as many ideas about the other failures.

@wanchaol
Copy link
Contributor

@wanchaol wanchaol commented May 23, 2019

@pytorchbot rebase this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

@wanchaol has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this issue May 24, 2019
Summary:
This PR is a eliminates unneeded grad_sum_to_size and in particular speeds up the LSTM backward by allowing better fusion.

It consists of two parts:
- In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
- The specialization of Optional arguments (#18407) allows us to then eliminate ` _grad_sum_to_size(t, None)` in the peephole optimization   step.

Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.

I'm testing that different broadcasting situations lead to different graphs.

I didn't move all symbolic_script _grad_sum_to_size to the new logic, but it might be better to do this incrementally, anyway.
Pull Request resolved: pytorch/pytorch#18697

Differential Revision: D15482076

Pulled By: wanchaol

fbshipit-source-id: 7f89367e35b8729910077c95c02bccefc8678afb
@facebook-github-bot
Copy link
Contributor

@facebook-github-bot facebook-github-bot commented May 24, 2019

@wanchaol merged this pull request in 17941f9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants