Permutation extended #76563

jjsjann123 · 2022-04-28T20:53:58Z

Extended permutation support in integration (See more details on csarofeen#1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.

The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: output = binaryOp(input0, input1)

In a simple case where input0 and input1 come with the same rank & permutation order, our output would preserve the same permutation;
For cases where input0 and input1 come with different ranks but with compatible permutation, the tensor with the higher rank dictates the permutation of the output;
For cases where input0 and input1 come with different ranks but with in-compatible permutation, this is where permutation propagation fails and the output tensor will be contiguous.

By compatible permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in MemoryFormat::broadcastToRank(int lower_rank).

Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):

different rank & same permutation

    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1])  # stride (1, wc, c)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0

different rank & compatible permutation

    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0

different rank & compatible permutation with broadcasting

    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1)  # stride (1, 1, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0

different rank & in-compatible permutation

    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w).cuda()  # stride (w, 1)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, wc, c, 1)  # nvfuser outputs contiguous tensor
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # TI preserves memory format of LHS operand

different rank & in-compatible permutation

    t0 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # nvfuser preserves memory format of highest rank tensors
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, hw, w, 1)  # TensorIterator preserves memory format of LHS operand

Extended permutation support in integration (See more details on pytorch#1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time. The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)` 1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation; 2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output; 3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous. By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`. Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5): 1. different rank & same permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 2. different rank & compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 3. different rank & compatible permutation with broadcasting ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 4. different rank & in-compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w).cuda() # stride (w, 1) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand ``` 5. different rank & in-compatible permutation ``` t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand ```

facebook-github-bot · 2022-04-28T20:54:04Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/76563
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit b35e67b (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-29T03:34:31.7680442Z FAIL [1.280s]: tes...ing_cuda_complex64 (__main__.TestNNDeviceTypeCUDA)

2022-04-29T03:34:31.6743508Z   test_xavier_normal (__main__.TestNNInit) ... ok (0.489s)
2022-04-29T03:34:31.6754207Z   test_xavier_normal_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s)
2022-04-29T03:34:31.7601195Z   test_xavier_uniform (__main__.TestNNInit) ... ok (0.084s)
2022-04-29T03:34:31.7608491Z   test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s)
2022-04-29T03:34:31.7641549Z   test_load_state_dict_module_pre_hook (__main__.TestStateDictHooks) ... /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1403: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
2022-04-29T03:34:31.7644142Z   " and ".join(warn_msg) + " are deprecated. nn.Module.state_dict will not accept them in the future. "
2022-04-29T03:34:31.7645594Z ok (0.003s)
2022-04-29T03:34:31.7678457Z   test_load_state_dict_pre_hook (__main__.TestStateDictHooks) ... ok (0.003s)
2022-04-29T03:34:31.7679150Z 
2022-04-29T03:34:31.7679464Z ======================================================================
2022-04-29T03:34:31.7680442Z FAIL [1.280s]: test_conv1d_same_padding_cuda_complex64 (__main__.TestNNDeviceTypeCUDA)
2022-04-29T03:34:31.7682008Z ----------------------------------------------------------------------
2022-04-29T03:34:31.7683771Z Traceback (most recent call last):
2022-04-29T03:34:31.7685442Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1796, in wrapper
2022-04-29T03:34:31.7686465Z     method(*args, **kwargs)
2022-04-29T03:34:31.7688094Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
2022-04-29T03:34:31.7689249Z     result = test(self, **param_kwargs)
2022-04-29T03:34:31.7690310Z   File "test_nn.py", line 13610, in test_conv1d_same_padding
2022-04-29T03:34:31.7691162Z     self.assertEqual(expect, actual)
2022-04-29T03:34:31.7693006Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2213, in assertEqual
2022-04-29T03:34:31.7694126Z     msg=msg,

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

jjsjann123 · 2022-04-28T20:55:12Z

Cherry-picked from csarofeen#1614

kevinstephano

LGTM

ngimel · 2022-04-29T00:19:54Z

torch/csrc/jit/codegen/cuda/parser.cpp

@@ -158,20 +174,110 @@ struct MemoryFormat {
    // storing stride_order in `permuted_order` for a simpler life, so we don't
    // have to decode `permutation_` when we want to apply/restore permutation_.
    permuted_order_ = stride_order;
-    bool has_permutation_ = false;
+    bool has_permutation = false;
    for (const auto i : c10::irange(rank)) {
      permutation_ = permutation_ * 10 + stride_order[i];


what if someone calls setPermutation twice? With permutation_ as a class member this will lead to weird results

🤕 I'll reset permutation_ to 0.

jjsjann123 · 2022-04-29T19:20:35Z

Bumping for review 🙇

jjsjann123 · 2022-05-02T18:34:07Z

Bumping up for review again 🙇

jjsjann123 · 2022-05-02T22:08:18Z

@pytorchbot merge this

github-actions · 2022-05-02T22:10:36Z

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Extended permutation support in integration (See more details on csarofeen#1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time. The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)` 1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation; 2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output; 3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous. By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`. Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5): 1. different rank & same permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 2. different rank & compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 3. different rank & compatible permutation with broadcasting ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 4. different rank & in-compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w).cuda() # stride (w, 1) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand ``` 5. different rank & in-compatible permutation ``` t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand ``` Pull Request resolved: #76563 Approved by: https://github.com/kevinstephano, https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d23619b030444e2a77daab6aaa60988b765ba471 Reviewed By: malfet Differential Revision: D36101858 Pulled By: malfet fbshipit-source-id: 17662c68d7f1b448d72b270d6cfa6b8aea463df6

Extended permutation support in integration (See more details on csarofeen/pytorch#1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time. The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)` 1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation; 2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output; 3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous. By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`. Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5): 1. different rank & same permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 2. different rank & compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 3. different rank & compatible permutation with broadcasting ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1) out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0 ``` 4. different rank & in-compatible permutation ``` t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) t1 = torch.randn(h, w).cuda() # stride (w, 1) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand ``` 5. different rank & in-compatible permutation ``` t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1) t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c) jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand ``` Pull Request resolved: pytorch/pytorch#76563 Approved by: https://github.com/kevinstephano, https://github.com/ngimel

facebook-github-bot added the cla signed label Apr 28, 2022

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 28, 2022

jjsjann123 changed the title ~~Permutation extended (#1614)~~ Permutation extended Apr 28, 2022

jjsjann123 requested review from davidberard98 and ngimel April 28, 2022 20:55

pytorchbot added the open source label Apr 28, 2022

kevinstephano approved these changes Apr 28, 2022

View reviewed changes

ngimel reviewed Apr 29, 2022

View reviewed changes

init permutation_ for each setPermutation

b35e67b

jjsjann123 requested a review from ngimel April 29, 2022 01:57

ngimel approved these changes May 2, 2022

View reviewed changes

pytorchmergebot closed this in d23619b May 2, 2022

jjsjann123 deleted the permutation_cherry_pick branch May 2, 2022 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Permutation extended #76563

Permutation extended #76563

Uh oh!

jjsjann123 commented Apr 28, 2022 •

edited by davidberard98

Loading

Uh oh!

facebook-github-bot commented Apr 28, 2022 •

edited

Loading

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Uh oh!

jjsjann123 commented Apr 28, 2022

Uh oh!

kevinstephano left a comment

Uh oh!

ngimel Apr 29, 2022

Uh oh!

jjsjann123 Apr 29, 2022

Uh oh!

jjsjann123 commented Apr 29, 2022

Uh oh!

jjsjann123 commented May 2, 2022

Uh oh!

jjsjann123 commented May 2, 2022

Uh oh!

github-actions bot commented May 2, 2022

Uh oh!

Uh oh!

Permutation extended #76563

Permutation extended #76563

Uh oh!

Conversation

jjsjann123 commented Apr 28, 2022 • edited by davidberard98 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Uh oh!

jjsjann123 commented Apr 28, 2022

Uh oh!

kevinstephano left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Apr 29, 2022

Uh oh!

jjsjann123 commented May 2, 2022

Uh oh!

jjsjann123 commented May 2, 2022

Uh oh!

github-actions bot commented May 2, 2022

Uh oh!

Uh oh!

jjsjann123 commented Apr 28, 2022 •

edited by davidberard98

Loading

facebook-github-bot commented Apr 28, 2022 •

edited

Loading