Fast path binary ops in fake tensor #94047

ezyang · 2023-02-03T14:17:37Z

Stack from ghstack (oldest at bottom):

-> Fast path binary ops in fake tensor #94047

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18, I get the following trace speedup.

Before:

cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:53.97591 backend_compile:33.60832
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010

After:

cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:40.18931 backend_compile:25.28828
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010

My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit#

This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment:

diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py
index e3bf545f3b8..395942c6ffe 100644
--- a/torch/_subclasses/fake_tensor.py
+++ b/torch/_subclasses/fake_tensor.py
@@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode):
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         kwargs = kwargs if kwargs else {}
 
+        with no_dispatch():
+            if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}:
+                return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda')
+
         if func == torch.ops.prim.device.default:
             assert len(args) == 1 and isinstance(args[0], FakeTensor)
             if args[0].fake_mode.in_kernel_invocation:

I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.)

The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences:

Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last).
I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right.

Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1))

Signed-off-by: Edward Z. Yang ezyang@meta.com

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

pytorch-bot · 2023-02-03T14:17:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94047

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 79 Pending

As of commit df2e401:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Could you give more details on the kind of gains we get from this? This does add significant complexity.

ezyang · 2023-02-03T14:26:24Z

PR description updated!

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 10s of trace time improvement on the table (5s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path through if at least one of the input operands matches the broadcasted shape exactly (the idea being that we will probably use that tensor's layout.) I am pretty sure this is not sound, but I need to check tests to see how unsound it is. * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. I intend to verify whether or not the new algorithm is correct using Z3. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 5ad698daf07101c437eb7868736c7f77c1b2e4f7 Pull Request resolved: #94047

eellison

have you tested short circuiting but reusing more of the methods from prims / elsewhere for this ? Would be nice to cut down on some of the duplication. I think a lot of the speedup would still be applicable

eellison · 2023-02-03T17:54:48Z

torch/_subclasses/fake_tensor.py

+        try:
+            return self.dispatch(func, types, args, kwargs)
+        except TypeError:
+            log.exception("fake tensor raised TypeError")
+            raise


what are these changes for ?

When I was working on this PR I sometimes messed up my short circuit logic and triggered a TypeError. This TypeError was silently swallowed. Now I get a log for it.

eellison · 2023-02-03T17:55:57Z

torch/_subclasses/fake_tensor.py

+            )
+        if is_contiguous:
+            # do contiguous
+            count_label("fast is_contiguous")


imo it's a little strange to have this on by default and only have telemetry on such a small part of the model

TBH we should do more telemetry. I'm happy to remove this but this was also very useful for understand perf characteristics here.

torch/_subclasses/fake_tensor.py

eellison · 2023-02-03T18:00:04Z

torch/utils/_stats.py


-simple_call_counter = collections.OrderedDict()
+simple_call_counter: OrderedDict[str, int] = collections.OrderedDict()


why do we need OrderedDict ? isn't dict already ordered ?

ask @voznesenskym

I like preserving key order for things like this.

torch/_subclasses/fake_tensor.py

ezyang · 2023-02-03T21:44:37Z

Now updated with some evidence the heuristic is OK; see bottom of PR desc=

ezyang · 2023-02-04T12:24:37Z

Here are the most improved models with this change:

and the least improved:

eca_halonext26ts is also interesting: a long running model that isn't helped much

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 10s of trace time improvement on the table (5s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 1c56d3d7930c3cf970742ea069f17941c2e18f5f Pull Request resolved: #94047

albanD

Debug code needs to go but sounds ok otherwise.

test/functorch/test_aotdispatch.py

albanD · 2023-02-06T15:04:37Z

torch/utils/_stats.py

+
+def count_label(label):
+    prev = simple_call_counter.setdefault(label, 0)
+    simple_call_counter[label] = prev + 1


You could use a defaultdict and just += 1 here.

IDK why @voznesenskym didn't make this a defaultdict, I was minimizing changes here

Sure, that's fine.

torch/_subclasses/fake_tensor.py

albanD · 2023-02-06T15:15:11Z

btw the debug code I said need to go is the one raised by Elias above that you said you'll remove before merging.

ezyang · 2023-02-06T16:29:56Z

OK, to be clear, @eellison do you want it removed? I would prefer it to stay but if someone says "please remove" I will remove.

albanD · 2023-02-06T16:47:38Z

I think the worse error message is worth fixing.
The others are all metrics collection and logging. So fine.

eellison

I'm fine with it I guess but if we get more telemetry we should make them not on by default. inductor has various loggings but I don't think any of them are on by default.

I don't particularly care in either direction. Alban's concern about readability is true but we can cross that bridge when we get to it.

eellison · 2023-02-06T18:08:35Z

torch/_subclasses/fake_tensor.py

+    return tuple(expandedSizes)
+
+
+def make_fast_binary_impl(slow_ref):


would it be worth moving this to fake_utils ? idk

it only needs to be used here, so let's keep it here. If we want, we could make a separate module for "fake tensor op implementations"

torch/_subclasses/fake_tensor.py

eellison · 2023-02-06T18:13:42Z

torch/_subclasses/fake_tensor.py

+            with mode:
+                return slow_ref(*args, **kwargs)
+
+        count_label("attempt fast")


would we even need the count_labels if these were factored out into functions ? how slow are python profilers? or maybe just annoying to use

Yeah, it's a combination of slowness (for non-sampling profilers) and also annoyance (the function call will be lost in a sea of other function calls.)

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 07fb7ed8733856f090a02cacb25000a27d08145d Pull Request resolved: #94047

ezyang · 2023-02-07T14:49:59Z

I'm fine with it I guess but if we get more telemetry we should make them not on by default. inductor has various loggings but I don't think any of them are on by default.

The telemetry goes into some counters which don't get printed by default. That's the same as how dynamo collects counter telemetry too.

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang · 2023-02-07T14:53:03Z

@pytorchbot merge

pytorchmergebot · 2023-02-07T14:56:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-07T15:06:49Z

Merge failed

Reason: 1 mandatory check(s) failed (Rule superuser). The first few are:

Lint / lintrunner

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang · 2023-02-07T15:15:27Z

@pytorchbot merge

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 982f51e9932173556f9f3aee7825beca8527d953 Pull Request resolved: #94047

pytorchmergebot · 2023-02-07T15:17:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fastpath binary ops

218ef4a

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

github-actions bot requested review from albanD, antoniojkim, bdhirsh, Chillee, jbschlosser, miladm, SherlockNoMad, voznesenskym and wconstab February 3, 2023 14:17

github-actions bot added the ciflow/inductor label Feb 3, 2023

albanD reviewed Feb 3, 2023

View reviewed changes

ezyang requested a review from ngimel February 3, 2023 14:26

ezyang changed the title ~~Fastpath binary ops~~ Fast path binary ops in fake tensor Feb 3, 2023

ezyang added a commit that referenced this pull request Feb 3, 2023

Fastpath binary ops

a6a482e

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 5ad698daf07101c437eb7868736c7f77c1b2e4f7 Pull Request resolved: #94047

ezyang added release notes: composability release notes category topic: not user facing topic category labels Feb 3, 2023

eellison reviewed Feb 3, 2023

View reviewed changes

ezyang added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 3, 2023

ezyang requested review from albanD and eellison February 4, 2023 01:21

ezyang added a commit that referenced this pull request Feb 4, 2023

Fastpath binary ops

ccdd878

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 1c56d3d7930c3cf970742ea069f17941c2e18f5f Pull Request resolved: #94047

albanD reviewed Feb 6, 2023

View reviewed changes

ezyang requested a review from albanD February 6, 2023 15:13

eellison approved these changes Feb 6, 2023

View reviewed changes

ezyang added a commit that referenced this pull request Feb 7, 2023

Fastpath binary ops

7e44729

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 07fb7ed8733856f090a02cacb25000a27d08145d Pull Request resolved: #94047

ezyang added a commit that referenced this pull request Feb 7, 2023

Fastpath binary ops

8176086

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 982f51e9932173556f9f3aee7825beca8527d953 Pull Request resolved: #94047

pytorchmergebot added the Merged label Feb 7, 2023

pytorchmergebot closed this in d690a59 Feb 7, 2023

facebook-github-bot deleted the gh/ezyang/1778/head branch June 8, 2023 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast path binary ops in fake tensor #94047

Fast path binary ops in fake tensor #94047

ezyang commented Feb 3, 2023 •

edited

Loading

pytorch-bot bot commented Feb 3, 2023 •

edited

Loading

albanD left a comment

ezyang commented Feb 3, 2023

eellison left a comment

eellison Feb 3, 2023

ezyang Feb 3, 2023

eellison Feb 3, 2023

ezyang Feb 3, 2023

eellison Feb 3, 2023

ezyang Feb 3, 2023

voznesenskym Feb 7, 2023

ezyang commented Feb 3, 2023

ezyang commented Feb 4, 2023

albanD left a comment

albanD Feb 6, 2023

ezyang Feb 6, 2023

voznesenskym Feb 7, 2023

albanD commented Feb 6, 2023

ezyang commented Feb 6, 2023

albanD commented Feb 6, 2023

eellison left a comment

eellison Feb 6, 2023

ezyang Feb 7, 2023

eellison Feb 6, 2023

ezyang Feb 7, 2023

ezyang commented Feb 7, 2023

ezyang commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

ezyang commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023


		simple_call_counter = collections.OrderedDict()
		simple_call_counter: OrderedDict[str, int] = collections.OrderedDict()

		return tuple(expandedSizes)


		def make_fast_binary_impl(slow_ref):

Fast path binary ops in fake tensor #94047

Fast path binary ops in fake tensor #94047

Conversation

ezyang commented Feb 3, 2023 • edited Loading

pytorch-bot bot commented Feb 3, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94047

⏳ No Failures, 79 Pending

albanD left a comment

Choose a reason for hiding this comment

ezyang commented Feb 3, 2023

eellison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Feb 3, 2023

ezyang commented Feb 4, 2023

albanD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albanD commented Feb 6, 2023

ezyang commented Feb 6, 2023

albanD commented Feb 6, 2023

eellison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Feb 7, 2023

ezyang commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

Merge started

pytorchmergebot commented Feb 7, 2023

Merge failed

ezyang commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

Merge started

ezyang commented Feb 3, 2023 •

edited

Loading

pytorch-bot bot commented Feb 3, 2023 •

edited

Loading