Fix the performance issue that the for-loop before ExternallCall #86516

EikanWang · 2022-10-08T01:33:17Z

Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of ExternallCall is also the output of NNC fusion group. Current parallel logic only tries to parallel the ExternalCall and bypass stmt1 and stmt2.

stmt1: For:
stmt2:   For:
stmt3: ExternalCall

Pull Request resolved: #85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher

…d not be parallelized. (#85056) Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`. ```c++ stmt1: For: stmt2: For: stmt3: ExternalCall ``` Pull Request resolved: #85056 Approved by: https://github.com/frank-wei, https://github.com/bertmaher

pytorch-bot · 2022-10-08T01:33:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86516

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c111172:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman

internal facing changes only, adding ciflow/trunk

atalman

Please provide some details if this fix falls into Critical Issues:
silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

EikanWang · 2022-10-13T07:21:29Z

@atalman , This PR is to fix the critical performance regression.

@malfet: This does not sound like a bugfix to me, and it's not a regression either, but rather a feature work.

We planned three features for NNC: Channels Last, BF16, and Post-op fusion. Regarding the Post-op fusion, it is for channels-last only. And there is still some design open. Hence, it will be suspended. But some post-op fusion PRs(77157 and 84038) have been landed. Since the implementation is on top of ExternalCall, it will trigger the NNC ExternalCall performance issue frequently compared to 1.12. Meanwhile, the post-op fusion is for channels last only. It impacts the NNC channels-last feature.

EikanWang · 2022-10-14T00:05:37Z

@atalman , please let me know if you have any other comments. cc @bertmaher

EikanWang · 2022-10-17T07:15:02Z

@atalman , may I know if you have any comments on this PR?

EikanWang · 2022-10-17T08:45:11Z

@atalman , this is a regression. Should I label it as a regression?

atalman · 2022-10-18T00:31:24Z

@EikanWang could you please explain the details of regression ?

EikanWang · 2022-10-18T00:50:04Z

@EikanWang could you please explain the details of regression?

Sure. Originally, any statement before ExternalCall could not be parallelized. So the performance of NNC was much worse than the aten operator. Take the following pseudo-code as an example.

stmt1: For:
stmt2:   For:
stmt3: ExternalCall

The stmt1 and stmt2 could not be parallelized, so the performance is worse. But if we exclude ExternalCall, the stmt1 and stmt2 would be parallelized and the performance would come back.

As I mentioned before, we optimized NNC by fusing convolution with some of its post operators via ExternalCall if the layout is channels-last. So we always pull the channels-last convolution into the NNC fusion group and use ExternalCall to redirect to the external kernel(#77157 and #84038).

This optimization introduced a side effect. That was the issue that would be triggered more frequently because there are many ExternalCalls in the NNC fusion group for CNN models. Since Convolutions is widely used in CNN models. We observed that even there was a 30% performance regression for some torchvision models.

This PR is to fix the performance regression.

pytorch-bot bot added the release notes: jit release notes category label Oct 8, 2022

EikanWang mentioned this pull request Oct 8, 2022

[v.1.13.0] Release Tracker #86312

Closed

pytorchbot added the open source label Oct 8, 2022

atalman approved these changes Oct 12, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 12, 2022

atalman requested changes Oct 12, 2022

View reviewed changes

EikanWang requested a review from atalman October 14, 2022 00:04

malfet approved these changes Oct 18, 2022

View reviewed changes

malfet merged commit f89a762 into pytorch:release/1.13 Oct 18, 2022

malfet mentioned this pull request Nov 23, 2022

Add c10:: namespace in front of optional #89605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the performance issue that the for-loop before ExternallCall #86516

Fix the performance issue that the for-loop before ExternallCall #86516

EikanWang commented Oct 8, 2022

pytorch-bot bot commented Oct 8, 2022 •

edited

atalman left a comment •

edited

atalman left a comment

EikanWang commented Oct 13, 2022 •

edited

EikanWang commented Oct 14, 2022

EikanWang commented Oct 17, 2022

EikanWang commented Oct 17, 2022

atalman commented Oct 18, 2022

EikanWang commented Oct 18, 2022 •

edited

Fix the performance issue that the for-loop before ExternallCall #86516

Fix the performance issue that the for-loop before ExternallCall #86516

Conversation

EikanWang commented Oct 8, 2022

pytorch-bot bot commented Oct 8, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86516

✅ No Failures

atalman left a comment • edited

Choose a reason for hiding this comment

atalman left a comment

Choose a reason for hiding this comment

EikanWang commented Oct 13, 2022 • edited

EikanWang commented Oct 14, 2022

EikanWang commented Oct 17, 2022

EikanWang commented Oct 17, 2022

atalman commented Oct 18, 2022

EikanWang commented Oct 18, 2022 • edited

pytorch-bot bot commented Oct 8, 2022 •

edited

atalman left a comment •

edited

EikanWang commented Oct 13, 2022 •

edited

EikanWang commented Oct 18, 2022 •

edited