[WIP] [RFC] add shape_preserving notions to decomps for fake_tensor specific short circuiting #93118

voznesenskym · 2023-01-27T00:55:09Z

hf_Reformer frame:74,backend:25 to frame:30,backend5
mobilenet_v3_large from f:34,b:26 to f18,b11

pytorch-bot · 2023-01-27T00:55:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93118

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 38 Failures

As of commit 39000f9:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2023-01-27T00:55:52Z

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

voznesenskym · 2023-01-27T01:07:10Z

torch/_refs/__init__.py

@@ -366,7 +368,7 @@ def __maybe_broadcast(x, shape):


 # Utilities should come BEFORE this import
-from torch._decomp import register_decomposition
+from torch._decomp import register_decomposition, shape_preserving_default


@Chillee prefers this entire logical set of things live in fake_tensor, I am not opposed, but open to other designs as well. Interested in feedback.

ngimel · 2023-01-27T00:57:54Z

torch/_decomp/__init__.py

+    t = next((x for x in args if isinstance(x, torch.Tensor)), None)
+    if t is not None:
+        return FakeTensor(
+            t.fake_mode, torch.empty(t.shape, device="meta"), device=t.device


I'm pretty sure you need to return empty_strided here, because decompositions are expected to propagate strides

ngimel · 2023-01-27T00:59:47Z

torch/_refs/__init__.py


        return _ref

    return inner


 # Add has its own implementation because it has an alpha argument
-@register_decomposition(aten.add)
+@register_decomposition(aten.add, shape_preserving=shape_preserving_default)


This is wrong (it will output the shape of the first arg, and add output can have nothing to do with the shape of the first arg)

ngimel · 2023-01-27T01:00:07Z

torch/_refs/__init__.py

@@ -1170,6 +1180,7 @@ def float_power(
 @_make_elementwise_binary_reference(
    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
    supports_two_python_scalars=True,
+    shape_preserving=shape_preserving_default,


same, binary ops cannot be shortcut this way

ngimel · 2023-01-27T01:10:39Z

torch/_refs/__init__.py

+        return args[0]
+    if isinstance(args[0], (int, float, SymInt, SymFloat)):
+        return args[1]
+    if args[0].shape == args[1].shape:


that's closer to what's needed for binary outs, but still not quite (0d cpu + 0d cuda or 0d cuda + 0d cpu would both produce cuda output)

ngimel · 2023-01-27T01:11:01Z

torch/_refs/__init__.py

@@ -1574,7 +1600,7 @@ def rsub(
 # TODO: add docstring
 # TODO: consider refactoring this with add impl
 # sub has its own implementation because it has an alpha argument
-@register_decomposition(aten.sub)
+@register_decomposition(aten.sub, shape_preserving=shape_preserving_default)


again, default doesn't work for binary ops

ngimel · 2023-01-27T01:11:40Z

torch/_refs/__init__.py

-ceil_ = _make_inplace(ceil)
-clamp_ = _make_inplace(clamp)
+ceil_ = _make_inplace(ceil, shape_preserving=shape_preserving_default)
+clamp_ = _make_inplace(clamp, shape_preserving=shape_preserving_default)


clamp is ternary op, it's output shape is the combination of all three inputs

ngimel · 2023-01-27T01:14:10Z

torch/_refs/__init__.py

 clamp_min_ = _make_inplace(clamp_min)
 clamp_max_ = _make_inplace(clamp_max)
 conj_physical_ = _make_inplace(conj_physical)
 copysign_ = _make_inplace(copysign)
-cos_ = _make_inplace(cos)
+cos_ = _make_inplace(cos, shape_preserving=shape_preserving_default)


no unfortunately, because if input dtype is integral, output dtype will be float

ngimel · 2023-01-27T01:14:30Z

torch/_refs/__init__.py

@@ -5265,12 +5291,12 @@ def exponential(self, rate=1, generator=None):
 sgn_ = _make_inplace(sgn)
 sigmoid_ = _make_inplace(sigmoid)
 sign_ = _make_inplace(sign)
-sin_ = _make_inplace(sin)
+sin_ = _make_inplace(sin, shape_preserving=shape_preserving_default)


same as cos for dtype

ezyang · 2023-01-27T03:01:54Z

Longer pr description plzzzz

ezyang · 2023-01-27T03:04:24Z

As @ngimel has already pointed out, this PR is not sound. Will need to discuss a real strategy here

ngimel · 2023-01-27T03:35:36Z

It can be implemented in a sound way, but if we want to (semi)automatically enable shortcuts, we need to do it more reliably. Unfortunately, as #93073 shows, OpInfo testing for refs/decompositions is not complete, so we can't rely on it to weed out all subtleties.

eellison · 2023-01-27T05:01:20Z

torch/_subclasses/fake_tensor.py

@@ -773,6 +773,15 @@ def __init__(
    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
        kwargs = kwargs if kwargs else {}

+        from torch._decomp import decomposition_table
+
+        if func in decomposition_table:


this shouldnt be before the constants handling, because we might need to invalidate constants

Its just jammed in there for demonstration purposes, It probably needs to be called a good deal lower!

voznesenskym · 2023-01-27T06:01:33Z

As @ngimel has already pointed out, this PR is not sound. Will need to discuss a real strategy here

yeah for sure, this is just to get the conversation started - I don't really intend to land this

voznesenskym · 2023-01-27T06:03:20Z

Reiterating again, this is not to land so much as to have a place to discuss this.

So I think there's 2 parts here worth discussing:

Individual short circuit impls for ops / types of ops (unary, binary (broadcasting), dtype changing, etc)
Where we do register short circuits (alongside decomps? In a separate top level decomp table? within fake_tensor.py?)

ezyang · 2023-01-27T17:23:24Z

I am mostly interested in the shortcut implementation. I think the way I would structure the experiments here is I would first start by doing a trivial but obviously sound condition for running the shortcut, and see how much benefit we get from that before moving on.

The trivial shortcut validity condition is this:

The sizes match exactly
The strides match exactly
The dtypes match exactly
The devices match exactly

Also, your really good perf numbers were when we skipped allocating a tensor. I am also curious what your perf is now that you are calling torch.empty; we suspected that this path needs to be optimized; one start is an optimized torch.empty_like which copies metadata from a pre-existing tensor so we don't need to recompute, e.g., contiguity, etc.

ezyang · 2023-01-27T17:24:01Z

For structuring the shortcuts, I think there's a decent chance that we can just put the shortcutting in the PrimTorch meta function itself, and that might be good enough.

ngimel · 2023-01-27T17:43:44Z

empty_like is allowed to copy metadata only if the model is non-overlapping-and-dense. But it's interesting how much is saved by skipping allocation compared to skipping tedious slow-path metadata computation. Another idea is to have pointwise meta prop function written in a way we can @lru_cache its results, then chances are it doesn't have to be too fast, as we are calling it with the same strides/shapes/devices all the time.

voznesenskym · 2023-01-27T18:22:25Z

empty_like is allowed to copy metadata only if the model is non-overlapping-and-dense. But it's interesting how much is saved by skipping allocation compared to skipping tedious slow-path metadata computation. Another idea is to have pointwise meta prop function written in a way we can @lru_cache its results, then chances are it doesn't have to be too fast, as we are calling it with the same strides/shapes/devices all the time.

This was a working idea for a while, but this line of inquiry was shelved due to us not wanting to has SymInts (As the hash would be op + args). We could reopen by hooking in with a pre-check, something like:

@lru_cache(None
def _short_circuit_tensor_only(op, *args, **kwargs):
    # probably call the decomp and meta func tables
    ....

def short_circuit(op, *args, **kwargs):
    if all([isinstance(x, torch.Tensor) for x in args]):
        return _short_circuit_tensor_only(op, *args, **kwargs)
    return None

ezyang · 2023-01-29T20:33:32Z

still needs dtype

voznesenskym · 2023-02-01T02:47:17Z

@ngimel give it another scan? Not done yet.

ezyang · 2023-02-01T18:36:18Z

torch/_subclasses/fake_tensor.py

@@ -771,6 +846,10 @@ def __init__(
        self.shape_env = shape_env

    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        # Gross circular reference hack
+        # import torch._prims as prims
+        skipped_prims = {torch._prims.mul}


What's this for?

ezyang · 2023-02-01T18:36:58Z

torch/_subclasses/fake_tensor.py

+def _short_circuit_unary_op(a):
+    if isinstance(a, FakeTensor):
+        return FakeTensor(
+            a.fake_mode, torch.empty(a.shape, device="meta"), device=a.device


this is not correct, need to preserve memory layout

ezyang · 2023-02-01T18:42:20Z

I'm taking over this

ngimel · 2023-02-01T18:48:30Z

torch/_subclasses/fake_tensor.py

+        return None
+
+    # Easy case - both match
+    safe_both_match = (a_broadcast.shape == b_broadcast.shape) or (


this should always be true, it's the final broadcast shape?

ngimel · 2023-02-01T18:49:46Z

torch/_subclasses/fake_tensor.py

+
+def _short_circuit_binary_broadcasting_op(a, b):
+    if isinstance(a, (int, float, SymInt, SymFloat)):
+        return b


not correct, if a is float/SymFloat and b is an integer tensor, result will be float

eellison · 2023-02-01T20:00:25Z

torch/_subclasses/fake_tensor.py

@@ -726,6 +727,80 @@ def merge_devices(t):
    __torch_function__ = torch._C._disabled_torch_function_impl


+def _short_circuit_alloc_op(fake_mode, a, b, **kwargs):


nit: rename to short_circuit empty_strided ? this is the only place it is being called right now, and it's easier to reason about this way

eellison · 2023-02-01T20:02:32Z

torch/_subclasses/fake_tensor.py

+
+short_circuit_binary_ops = {
+    aten.add.Tensor,
+    aten.add_.Tensor,


we're missing additional correctness conditions if use inplace - add_(a, b), b has to broadcast/type promote to a, a cannot broadcast/type promote

eellison · 2023-02-01T20:04:09Z

torch/_subclasses/fake_tensor.py

+
+    return FakeTensor(
+        a_broadcast.fake_mode,
+        torch.empty(a_broadcast.shape, device="meta"),


strides here too, + dtype, + aliasing if we're doing add_

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 10s of trace time improvement on the table (5s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path through if at least one of the input operands matches the broadcasted shape exactly (the idea being that we will probably use that tensor's layout.) I am pretty sure this is not sound, but I need to check tests to see how unsound it is. * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. I intend to verify whether or not the new algorithm is correct using Z3. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 10s of trace time improvement on the table (5s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of #93118 but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #94047 Approved by: https://github.com/eellison

github-actions · 2023-04-02T20:33:38Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions bot added the ciflow/inductor label Jan 27, 2023

github-actions bot requested review from albanD, antoniojkim, bdhirsh, Chillee, ezyang, jbschlosser, miladm, SherlockNoMad and wconstab January 27, 2023 00:55

voznesenskym commented Jan 27, 2023

View reviewed changes

ngimel reviewed Jan 27, 2023

View reviewed changes

eellison reviewed Jan 27, 2023

View reviewed changes

voznesenskym added 2 commits January 27, 2023 08:23

Alternate impl

6adbaf3

lint

ab49d39

voznesenskym force-pushed the voz/fixes_alloc branch from ff18ac2 to ab49d39 Compare January 27, 2023 08:26

wip

565d8b6

A little more wip

41e1128

albanD removed their request for review January 30, 2023 18:20

voznesenskym added 2 commits January 31, 2023 22:25

wip

d85f438

A little fixing

39000f9

ezyang reviewed Feb 1, 2023

View reviewed changes

ngimel reviewed Feb 1, 2023

View reviewed changes

eellison reviewed Feb 1, 2023

View reviewed changes

ezyang mentioned this pull request Feb 3, 2023

Fast path binary ops in fake tensor #94047

Closed

github-actions bot added the Stale label Apr 2, 2023

ezyang closed this Apr 3, 2023

		@@ -726,6 +727,80 @@ def merge_devices(t):
		__torch_function__ = torch._C._disabled_torch_function_impl


		def _short_circuit_alloc_op(fake_mode, a, b, **kwargs):

[WIP] [RFC] add shape_preserving notions to decomps for fake_tensor specific short circuiting #93118

[WIP] [RFC] add shape_preserving notions to decomps for fake_tensor specific short circuiting #93118

Conversation

voznesenskym commented Jan 27, 2023

pytorch-bot bot commented Jan 27, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93118

❌ 38 Failures

github-actions bot commented Jan 27, 2023

This PR needs a label

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jan 27, 2023

ezyang commented Jan 27, 2023

ngimel commented Jan 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voznesenskym commented Jan 27, 2023

voznesenskym commented Jan 27, 2023 • edited

ezyang commented Jan 27, 2023

ezyang commented Jan 27, 2023

ngimel commented Jan 27, 2023

voznesenskym commented Jan 27, 2023

ezyang commented Jan 29, 2023

voznesenskym commented Feb 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Feb 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 2, 2023

pytorch-bot bot commented Jan 27, 2023 •

edited

voznesenskym commented Jan 27, 2023 •

edited