Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the performance issue that the for-loop before ExternallCall #86516

Merged
merged 1 commit into from Oct 18, 2022
Merged

Fix the performance issue that the for-loop before ExternallCall #86516

merged 1 commit into from Oct 18, 2022

Conversation

EikanWang
Copy link
Collaborator

Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of ExternallCall is also the output of NNC fusion group. Current parallel logic only tries to parallel the ExternalCall and bypass stmt1 and stmt2.

stmt1: For:
stmt2:   For:
stmt3: ExternalCall

Pull Request resolved: #85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher

…d not be parallelized. (#85056)

Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`.

```c++
stmt1: For:
stmt2:   For:
stmt3: ExternalCall
```
Pull Request resolved: #85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 8, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86516

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c111172:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Oct 8, 2022
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal facing changes only, adding ciflow/trunk

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 12, 2022
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide some details if this fix falls into Critical Issues:
silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

@EikanWang
Copy link
Collaborator Author

EikanWang commented Oct 13, 2022

@atalman , This PR is to fix the critical performance regression.

@malfet: This does not sound like a bugfix to me, and it's not a regression either, but rather a feature work.

We planned three features for NNC: Channels Last, BF16, and Post-op fusion. Regarding the Post-op fusion, it is for channels-last only. And there is still some design open. Hence, it will be suspended. But some post-op fusion PRs(77157 and 84038) have been landed. Since the implementation is on top of ExternalCall, it will trigger the NNC ExternalCall performance issue frequently compared to 1.12. Meanwhile, the post-op fusion is for channels last only. It impacts the NNC channels-last feature.

@EikanWang
Copy link
Collaborator Author

@atalman , please let me know if you have any other comments. cc @bertmaher

@EikanWang
Copy link
Collaborator Author

@atalman , may I know if you have any comments on this PR?

@EikanWang
Copy link
Collaborator Author

@atalman , this is a regression. Should I label it as a regression?

@atalman
Copy link
Contributor

atalman commented Oct 18, 2022

@EikanWang could you please explain the details of regression ?

@EikanWang
Copy link
Collaborator Author

EikanWang commented Oct 18, 2022

@EikanWang could you please explain the details of regression?

Sure. Originally, any statement before ExternalCall could not be parallelized. So the performance of NNC was much worse than the aten operator. Take the following pseudo-code as an example.

stmt1: For:
stmt2:   For:
stmt3: ExternalCall

The stmt1 and stmt2 could not be parallelized, so the performance is worse. But if we exclude ExternalCall, the stmt1 and stmt2 would be parallelized and the performance would come back.

As I mentioned before, we optimized NNC by fusing convolution with some of its post operators via ExternalCall if the layout is channels-last. So we always pull the channels-last convolution into the NNC fusion group and use ExternalCall to redirect to the external kernel(#77157 and #84038).

This optimization introduced a side effect. That was the issue that would be triggered more frequently because there are many ExternalCalls in the NNC fusion group for CNN models. Since Convolutions is widely used in CNN models. We observed that even there was a 30% performance regression for some torchvision models.

This PR is to fix the performance regression.

@malfet malfet merged commit f89a762 into pytorch:release/1.13 Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request open source release notes: jit release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants