[nnc] Support thread level parallelism in fused kernels #63386

bertmaher · 2021-08-17T03:08:46Z

Stack from ghstack:

Differential Revision: D30360382

[ghstack-poisoned]

ghstack-source-id: 2a4555d Pull Request resolved: #63386

facebook-github-bot · 2021-08-17T03:08:52Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/63386
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit 82d2274 (more details on the Dr. CI page):

5/5 failures possibly* introduced in this PR
- 1/5 non-scanned failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-bionic-py3.8-gcc9-coverage / build (1/4)

Step: "Build PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-08-20T03:19:03.3988576Z Build left local git repository checkout dirty

2021-08-20T03:18:55.6508299Z multiple input files                 13
2021-08-20T03:18:55.6508526Z 
2021-08-20T03:18:55.6509886Z Cache location                  S3, bucket: Bucket(name=ossci-compiler-cache-circleci-v2, base_url=http://ossci-compiler-cache-circleci-v2.s3.amazonaws.com/)
2021-08-20T03:18:55.6511149Z + assert_git_not_dirty
2021-08-20T03:18:55.6511754Z + [[ linux-bionic-py3.8-gcc9-coverage != *rocm* ]]
2021-08-20T03:18:55.6512543Z + [[ linux-bionic-py3.8-gcc9-coverage != *xla* ]]
2021-08-20T03:18:55.6513145Z ++ git status --porcelain
2021-08-20T03:19:03.3986367Z + git_status='?? third_party/breakpad/'
2021-08-20T03:19:03.3986935Z + [[ -n ?? third_party/breakpad/ ]]
2021-08-20T03:19:03.3987837Z + echo 'Build left local git repository checkout dirty'
2021-08-20T03:19:03.3988576Z Build left local git repository checkout dirty
2021-08-20T03:19:03.3989325Z + echo 'git status --porcelain:'
2021-08-20T03:19:03.3989786Z git status --porcelain:
2021-08-20T03:19:03.3990235Z + echo '?? third_party/breakpad/'
2021-08-20T03:19:03.3994996Z ?? third_party/breakpad/
2021-08-20T03:19:03.3995611Z + exit 1
2021-08-20T03:19:03.3996079Z + cleanup
2021-08-20T03:19:03.3996355Z + retcode=1
2021-08-20T03:19:03.3996646Z + set +x
2021-08-20T03:19:03.3997007Z =================== sccache compilation log ===================
2021-08-20T03:19:03.4154050Z =========== If your build fails, please take a look at the log above for possible reasons ===========

pytorch_linux_xenial_py3_6_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

+ git merge --allow-unrelated-histories --no-edit --no-ff 0a66d5b3253fd2d2304f3897526db3c8fb139376
Auto-merging torch/testing/_internal/common_methods_invocations.py
Auto-merging torch/quantization/qconfig.py
CONFLICT (content): Merge conflict in torch/quantization/qconfig.py
Auto-merging torch/quantization/fx/prepare.py
CONFLICT (content): Merge conflict in torch/quantization/fx/prepare.py
Auto-merging torch/csrc/jit/tensorexpr/loopnest.cpp
Auto-merging test/test_jit.py
Auto-merging test/test_fx.py
Auto-merging test/cpp/tensorexpr/test_loopnest.cpp
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_clang7_asan_test1 (3/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 20 03:47:46 SUMMARY: UndefinedBehaviorSanit.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in

Aug 20 03:47:46     #4 0x5558acdad15f  (/opt/conda/bin/python3.6+0x13015f)
Aug 20 03:47:46     #5 0x5558acdef8f2  (/opt/conda/bin/python3.6+0x1728f2)
Aug 20 03:47:46     #6 0x5558ace57cd5  (/opt/conda/bin/python3.6+0x1dacd5)
Aug 20 03:47:46     #7 0x5558ace59d5d  (/opt/conda/bin/python3.6+0x1dcd5d)
Aug 20 03:47:46     #8 0x5558ace59dbb  (/opt/conda/bin/python3.6+0x1dcdbb)
Aug 20 03:47:46     #9 0x5558ace5a926  (/opt/conda/bin/python3.6+0x1dd926)
Aug 20 03:47:46     #10 0x5558acd94196  (/opt/conda/bin/python3.6+0x117196)
Aug 20 03:47:46     #11 0x7feb0668f83f  (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
Aug 20 03:47:46     #12 0x5558ace2433d  (/opt/conda/bin/python3.6+0x1a733d)
Aug 20 03:47:46 
Aug 20 03:47:46 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
Aug 20 03:47:46 + retcode=1
Aug 20 03:47:46 + set -e
Aug 20 03:47:46 + return 1
Aug 20 03:47:46 + [[ pytorch-linux-xenial-py3-clang7-asan-test1 == *-NO_AVX-* ]]
Aug 20 03:47:46 + [[ '' == \n\o\g\p\u\_\N\O\_\A\V\X ]]
Aug 20 03:47:46 + [[ pytorch-linux-xenial-py3-clang7-asan-test1 == *-NO_AVX2-* ]]
Aug 20 03:47:46 + [[ '' == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
Aug 20 03:47:46 + [[ pytorch-linux-xenial-py3-clang7-asan-test1 == *-NO_AVX512-* ]]
Aug 20 03:47:46 + [[ '' == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
Aug 20 03:47:46 + '[' -n https://github.com/pytorch/pytorch/pull/63386 ']'

pytorch_xla_linux_bionic_py3_6_clang9_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

+ git merge --allow-unrelated-histories --no-edit --no-ff 0a66d5b3253fd2d2304f3897526db3c8fb139376
Auto-merging torch/testing/_internal/common_methods_invocations.py
Auto-merging torch/quantization/qconfig.py
CONFLICT (content): Merge conflict in torch/quantization/qconfig.py
Auto-merging torch/quantization/fx/prepare.py
CONFLICT (content): Merge conflict in torch/quantization/fx/prepare.py
Auto-merging torch/csrc/jit/tensorexpr/loopnest.cpp
Auto-merging test/test_jit.py
Auto-merging test/test_fx.py
Auto-merging test/cpp/tensorexpr/test_loopnest.cpp
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

1 job timed out:

pytorch_linux_xenial_py3_clang7_asan_test1

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bertmaher · 2021-08-17T03:40:39Z

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

Pull Request resolved: #63386 Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382/) ghstack-source-id: 136005946

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

Pull Request resolved: #63386 Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382/) ghstack-source-id: e2b4dde

bertmaher · 2021-08-18T20:29:55Z

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher · 2021-08-19T03:00:14Z

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

navahgar · 2021-08-19T17:22:07Z

torch/csrc/jit/tensorexpr/kernel.cpp

+  for (int64_t i = loops.size(); i > 0; i--) {
+    auto const& loop = loops[i - 1];
+    if (auto stop = to<IntImm>(loop->stop())) {
+      grainSize *= stop->value();


This is assuming the loops are normalized at this point. While that might be true mostly, I'm not sure that is guaranteed.

I don't think anything's likely to un-normalize the loops before this point, but maybe I should simplify (and do stop-start) just to be safe. I guess the worst that happens is we miss a parallelization opportunity.

navahgar · 2021-08-19T17:28:08Z

torch/csrc/jit/tensorexpr/kernel.cpp

+template <typename Bufs>
+static void parallelizeOuterLoops(LoopNest& l, Bufs&& bufs) {
+  for (auto const& buf : bufs) {
+    auto loops = l.getLoopStmtsFor(buf);


Since this function is called after fuseAllLoops it is possible that multiple buffers belong to the same loopnest. So, we could be repeating this loop multiple times for the same loopnest. I understand that that may not be incorrect at this point, but it could lead to bugs in future.

IMO, we shouldn't be looking at output buffers and their loopnests. Instead, we should just take root_stmt in the given LoopNest and apply parallelization for all loopnests in that stmt. Wdyt?

That seems OK to me, sure.

I guess where things could get a little weird is if multiple buffers are updated at different levels of the loopnest, e.g.:

for i: y1[] = ... for j: y2[] = ...

The current approach sort of gives each buffer an "independent" chance to affect the loop parallelization. That doesn't seem terrible, tbh.

Actually the more I think about it, is there really any advantage to starting with the root stmt and working down? From my POV it just makes the code a lot more complicated; the way things happen now I just get a nice vector of loops leading to a buffer and try to flatten them. If it's not flattenable, it simply fails and I give up.

Although, maybe it's not too much work to build up my own vector starting from the root. Idk.

If you are okay with having the same set of loops being handled here for different bufs, then I have no objections to it.

Personally, I felt starting from the root_stmt might be better. We might need another API to extract all loops in the root_stmt. So, may be we can do this in future.

If I understand correctly, going through all buffers would not miss parallelism opportunities like the following

for i for j1 y1 = ... [data dependence exists between iterations] for j2 y2 = ... [no data dependence between iterations]

i+j1 cannot be parallelized because there's data dependence between iterations for y1; but i+j2 can be parallelized and we should not miss it. If this is the thing we try to do here, I guess we need a distribute transformation before flatten. flatten currently only handles perfectly nested loops.

Yeah the approach here will definitely miss opportunities where some nested loops are parallelizable. It's really kind of a best-effort thing to get simple elementwise fusions right, not a general solution to parallelism.

ZolotukhinM · 2021-08-19T19:07:13Z

torch/csrc/jit/tensorexpr/kernel.cpp

+      continue;
+    }
+    // Try to flatten the outer loops and parallelize them if successful.
+    For* flattened = nullptr;


Nit: ForPtr please ;)

navahgar

LGTM

huiguoo · 2021-08-19T22:26:44Z

torch/csrc/jit/tensorexpr/llvm_codegen.cpp

+        callee(index, packed_data);
+      }
+    });
+  } catch (...) {


kinda curious about this place: why not terminate if there's an exception? I guess the execution of the left stmts would ultimately lead to wrong results？

Interesting point... if an exception happens here things are really screwed up, because we don't know how to unwind past llvm-generated frames. But no exceptions should be possible here, since we're just parallel-dispatching to our own kernel, which doesn't throw exceptions. So I was mainly putting the try-catch here to ensure that the compiler knew that it wouldn't need to unwind this frame.

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher · 2021-08-20T03:19:08Z

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-08-20T18:18:38Z

@bertmaher merged this pull request in d6d86ef.

facebook-github-bot · 2021-08-21T10:48:03Z

This pull request has been reverted by 37d60c0.

[nnc] Support thread level parallelism in fused kernels

25dbf98

[ghstack-poisoned]

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Aug 17, 2021

bertmaher added a commit that referenced this pull request Aug 17, 2021

[nnc] Support thread level parallelism in fused kernels

6760d00

ghstack-source-id: 2a4555d Pull Request resolved: #63386

Update on "[nnc] Support thread level parallelism in fused kernels"

ac73fc8

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher added a commit that referenced this pull request Aug 17, 2021

[nnc] Support thread level parallelism in fused kernels

9892f97

Pull Request resolved: #63386 Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382/) ghstack-source-id: 136005946

Update on "[nnc] Support thread level parallelism in fused kernels"

0670ad3

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher requested review from ZolotukhinM, huiguoo and navahgar and removed request for ZolotukhinM August 18, 2021 20:24

Update on "[nnc] Support thread level parallelism in fused kernels"

7e4cd4e

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher added a commit that referenced this pull request Aug 18, 2021

[nnc] Support thread level parallelism in fused kernels

a3d0472

Pull Request resolved: #63386 Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382/) ghstack-source-id: e2b4dde

bertmaher mentioned this pull request Aug 18, 2021

Remove flag to toggle CPU fusion in the presence of parallelism #63514

Closed

Update on "[nnc] Support thread level parallelism in fused kernels"

385846b

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

bertmaher mentioned this pull request Aug 19, 2021

[nnc] Enable CPU fusion #63545

Closed

navahgar reviewed Aug 19, 2021

View reviewed changes

ZolotukhinM reviewed Aug 19, 2021

View reviewed changes

navahgar approved these changes Aug 19, 2021

View reviewed changes

huiguoo reviewed Aug 19, 2021

View reviewed changes

Update on "[nnc] Support thread level parallelism in fused kernels"

82d2274

Differential Revision: [D30360382](https://our.internmc.facebook.com/intern/diff/D30360382) [ghstack-poisoned]

facebook-github-bot closed this in d6d86ef Aug 20, 2021

facebook-github-bot added the Merged label Aug 20, 2021

facebook-github-bot added the Reverted label Aug 21, 2021

facebook-github-bot deleted the gh/bertmaher/149/head branch August 24, 2021 14:16

[nnc] Support thread level parallelism in fused kernels #63386

[nnc] Support thread level parallelism in fused kernels #63386

Uh oh!

Conversation

bertmaher commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 4 new failures recognized by patterns

linux-bionic-py3.8-gcc9-coverage / build (1/4)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/4)

pytorch_linux_xenial_py3_clang7_asan_test1 (3/4)

pytorch_xla_linux_bionic_py3_6_clang9_build (4/4)

ci.pytorch.org: 1 failed

Uh oh!

bertmaher commented Aug 17, 2021

Uh oh!

bertmaher commented Aug 18, 2021

Uh oh!

bertmaher commented Aug 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

navahgar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bertmaher commented Aug 20, 2021

Uh oh!

facebook-github-bot commented Aug 20, 2021

Uh oh!

facebook-github-bot commented Aug 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bertmaher commented Aug 17, 2021 •

edited

Loading

facebook-github-bot commented Aug 17, 2021 •

edited

Loading