New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the performance issue that the for-loop before ExternallCall #86516
Conversation
…d not be parallelized. (#85056) Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`. ```c++ stmt1: For: stmt2: For: stmt3: ExternalCall ``` Pull Request resolved: #85056 Approved by: https://github.com/frank-wei, https://github.com/bertmaher
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86516
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c111172: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal facing changes only, adding ciflow/trunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide some details if this fix falls into Critical Issues:
silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks
@atalman , This PR is to fix the critical performance regression.
We planned three features for NNC: Channels Last, BF16, and Post-op fusion. Regarding the Post-op fusion, it is for channels-last only. And there is still some design open. Hence, it will be suspended. But some post-op fusion PRs(77157 and 84038) have been landed. Since the implementation is on top of ExternalCall, it will trigger the NNC ExternalCall performance issue frequently compared to 1.12. Meanwhile, the post-op fusion is for channels last only. It impacts the NNC channels-last feature. |
@atalman , please let me know if you have any other comments. cc @bertmaher |
@atalman , may I know if you have any comments on this PR? |
@atalman , this is a regression. Should I label it as a regression? |
@EikanWang could you please explain the details of regression ? |
Sure. Originally, any stmt1: For:
stmt2: For:
stmt3: ExternalCall The As I mentioned before, we optimized NNC by fusing This optimization introduced a side effect. That was the issue that would be triggered more frequently because there are many This PR is to fix the performance regression. |
Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of
ExternallCall
is also the output of NNC fusion group. Current parallel logic only tries to parallel theExternalCall
and bypassstmt1
andstmt2
.Pull Request resolved: #85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher