-
Notifications
You must be signed in to change notification settings - Fork 25.2k
improve multi-core performance of qupsample_nearest2d #69600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 00d8214 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
[ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Hi @jerryzh168 We have a few int8-related PRs targeting 1.11. And this is one of them. Could you prioritize to review these PRs? Thanks. |
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Updates of this PROverall ideas of #69600 and #69601 are the same: improves Both The original kernel on nchw layout is sequential, loops on an order of {H, W, NC} which lead to non-contiguous memory access on {NC} (usually we need to make sure the most inner loop is contiguous). But the benefit is that each index on the output feature map plane only calculates its corresponding input feature map window only once. If we simply loops on the order of {NC, H, W}, the input index calculate would be duplicated by NC times, but the memory access is contiguous. This PR did a tradeoff: pre-calculate input index on {W} dimension so that the index calculation would not be overwhelming, parallel is done on {NC, H}, for normal width the pre-calculated index should be L1 reside. On nhwc layout, the original kernel is sequential, this PR parallels on {N, H, W} and vectorization is done on {C}. Benchmarking
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual sockets, 20 cores per socket. 1.a) single core run on short tag (nchw)The first few cases in the short tag is too small (non-contiguous access issue is alleviated)
1.b) single socket run on short tag (nchw)Again, the first few cases in the short tag is too small...
2.a) single core run on long tag (nchw)the single core performance improves since the non-contiguous access is fixed.
2.b) single socket run on long tag (nchw)
3.a) single core run (nchw v.s. nhwc)qupsample favors nhwc over nchw since it can be vectorized on nhwc.
3.b) single socket run (nchw v.s. nhwc)
|
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@pytorchbot merge this (Initiating merge automatically since Phabricator Diff has merged) |
Summary: Pull Request resolved: #69600 Differential Revision: D33353153 D33353153 Test Plan: Imported from OSS Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 15cb72d043b371f251dc3f2e03e6cb0243c6922c
Stack from ghstack:
Differential Revision: D33353153