Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add channels last for AdaptiveAvgPool2d #48916

Closed
wants to merge 1 commit into from

Conversation

mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Dec 7, 2020

Stack from ghstack:

optimize adaptive average pool2d forward path

optimize adaptive average pool2d backward path

remove unused headers

minor change

minor change

rename the header; add adaptive max pooling in future.

minor change

loosen adapative_pool2d test on nhwc to both device cuda and cpu

minor change

Differential Revision: D25399469

optimize adaptive average pool2d forward path

optimize adaptive average pool2d backward path

remove unused headers

minor change

minor change

rename the header; add adaptive max pooling in future.

minor change

loosen adapative_pool2d test on nhwc to both device cuda and cpu

minor change

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

mingfeima commented Dec 7, 2020

use this one to replace #42104.

Updates:

  1. Add support for ChannelsLast memory format on CPU, this path is vectorized upon the dimension of channels.
  2. Move Contiguous path to native/cpu.

adaptive_avg_pool2d has a fast path when output_size is 1x1 for contiguous memory format, this patch did not change that. Similar approach on channels last requires a reshape which makes it less performant, since the generic kernel adaptive_avg_pool2d_kernel on channels last already performs better than the fast path, i skip to the implement fast path for channels last when output size is 1x1.

Results:

Machine: CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 2*20 cores.
Bench: Use this script to reproduce, ./run.sh adaptive_avg_pool2d.py.
Input size: [1, 2048, 7, 7], [128, 2048, 7, 7]: (size in ResNet50).
Output size: [1, 1], [2, 2]
Both single thread (1 core) and single socket (20 core) are tested.

Code base: before: 96aaa311, after: 86bf0cc7.

Time per iteration (unit: ms), the lower the better.

#cores input_size output_size before (contiguous) after (contiguous) after (channels_last) cl/contig
20 [1, 2028, 7, 7] [2, 2] 0.025 0.021 0.02 1.05
20 [128, 2028, 7, 7] [2, 2] 1.973 1.021 0.373 2.74
20 [1, 2028, 7, 7] [1, 1] 0.033 0.032 0.022 1.45
20 [128, 2028, 7, 7] [1, 1] 0.44 0.451 0.311 1.45
1 [1, 2028, 7, 7] [2, 2] 0.22 0.146 0.034 4.29
1 [128, 2028, 7, 7] [2, 2] 26.806 16.949 5.309 3.19
1 [1, 2028, 7, 7] [1, 1] 0.037 0.037 0.016 2.31
1 [128, 2028, 7, 7] [1, 1] 4.649 4.605 3.626 1.27

@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #48916 (06a3207) into gh/mingfeima/2/base (e429d05) will decrease coverage by 0.00%.
The diff coverage is 94.11%.

@@                   Coverage Diff                   @@
##           gh/mingfeima/2/base   #48916      +/-   ##
=======================================================
- Coverage                80.76%   80.76%   -0.01%     
=======================================================
  Files                     1867     1869       +2     
  Lines                   201584   201542      -42     
=======================================================
- Hits                    162817   162782      -35     
+ Misses                   38767    38760       -7     

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 690eaf9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants