Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] JIT: Elmiminate SumToSize by using Optional Lists #18697

Open
wants to merge 18 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@t-vi
Copy link
Collaborator

t-vi commented Apr 1, 2019

This PR is a proposed alternative to #18120 and would achieve very similar fusion (in particular for LSTM backward).

It consists of three parts:

  • Specialize Non-Tensor Optional inputs to graphs to be either of NoneType or of the elementType.
    This needs the ArgumentSpec to be different for the two cases (so far we only differentiate on Tensor inputs).
  • In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
  • The specialization allows us to then eliminate _grad_sum_to_size(t, None) in the peephole optimization step.

Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.

I label this WIP because I didn't integrate tests yet and because I didn't move all symbolic_script _grad_sum_to_size to the new logic.

However, it would be great to have some discussion about implementation details as they raised some eyebrows in #18407 (which did proposed specialization for Optiona[Tensor]) and its predecessor.

t-vi added some commits Mar 24, 2019

Specialize Optional (Tensor) to None when executing graph
In #18360, we used undefined Tensor (aka AutogradZeroTensor),
but this can be errorprone when the type or value is compared
to None, e.g. as seen when comined with the (not yet landed)

For this to work, we must allow None passed to functions
taking Tensor?.
[WIP] JIT: Elmiminate SumToSize by using Optional Lists
This PR is a proposed alternative to #18120 and would achieve
very similar fusion (in particular for LSTM backward).

It consists of three parts:
- Specialize Non-Tensor Optional inputs to graphs to be
  either of NoneType or of the elementType.
  This needs the graph spec to be different for the two cases.
- In AutoDiff, record broadcasting sizes only if the
  broadcast output size is different from the input size,
  otherwise record None.
- The specialization allows us to eliminate
  _grad_sum_to_size(t, None) in the peephole optimization
  step.

Thus, in the LSTM case, no SumToSize remain in the crucial fusion
group. The trick here is that we can specialize on the runtime
information from the forward.

I label this WIP because I didn't integrate tests yet and because
I didn't move all symbolic_script _grad_sum_to_size to the new
logic.

However, it would be great to have some discussion given that
some eyebrows in implementation details.

t-vi added some commits Apr 1, 2019

@ngimel

This comment has been minimized.

Copy link
Contributor

ngimel commented Apr 2, 2019

It would be nice to add a test, e.g. run scripted function for some inputs, then test that .graph_for for inputs with the different broadcasting pattern errors out.

@apaszke

This comment has been minimized.

Copy link
Member

apaszke commented Apr 2, 2019

It would also be great if we could get some perf numbers to ensure that we don't slow down forward.

@t-vi

This comment has been minimized.

Copy link
Collaborator Author

t-vi commented Apr 2, 2019

Unfortunately we do seem to slow down the forward significantly (4%).

@apaszke

This comment has been minimized.

Copy link
Member

apaszke commented Apr 3, 2019

Ok, that should be fixable, because it's not like we're doing significantly more work.

@t-vi

This comment has been minimized.

Copy link
Collaborator Author

t-vi commented Apr 4, 2019

@ngimel pointed out a problem with this: The forward pass seems to instantiate intermediate results where previously we didn't. Thanks!

So after fixing this by taking passing the sizes, the forward performance is back and I could get rid of the bogus autodiff additions. But in a way, it's a bit of a mess that we now have broadcasting handling in autodiff and the graph_fuser that doesn't really interact (so e.g. prim::BroadcastSizes could also give me the information if broadcasting happened).

Hm. And merging the optional_None PR branch was a medium quality idea, because now we don't get the diff separately. The files changed relative to that are

a/aten/src/ATen/core/interned_strings.h
a/torch/csrc/jit/autodiff.cpp
a/torch/csrc/jit/passes/graph_fuser.cpp
a/torch/csrc/jit/passes/peephole.cpp (the changes relating to _grad_sum_to_size)
a/torch/csrc/jit/register_prim_ops.cpp
a/torch/csrc/jit/symbolic_script.cpp
a/torch/csrc/jit/symbolic_variable.h
@@ -861,6 +862,8 @@ struct GraphFuser {
// The output of producer_for_chunk_node could have been used in some
// aten::size operators, so we need to clean those up as well (we simply
// broadcast all its tensor inputs).
// We need to insert these earlier in the graph.
WithInsertPoint guard2(producer_for_chunk_node);

This comment has been minimized.

Copy link
@t-vi

t-vi Apr 4, 2019

Author Collaborator

So here I had to change the insert point because the current code made assumptions about the size not being used between producer_for_chunk_node and the new bchunk and the _size_if_not_same sat just there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.