BatchNorm channels last memory format optimization on CPU #46234

mingfeima · 2020-10-13T05:51:51Z

BatchNorm channels last memory format optimization on CPU

dr-ci · 2020-10-13T06:03:12Z

💊 CI failures summary and remediations

As of commit 6b3ffd4 (more details on the Dr. CI page):

4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_linux_xenial_py3_6_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_xla_linux_bionic_py3_6_clang9_build (3/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build (4/4)

Step: "Checkout pytorch/builder repo" (full log | diagnosis details | 🔁 rerun)

fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878

+ git submodule update --init --recursive 
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl' 
+ sleep 4 
+ git submodule update --init --recursive 
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl' 
+ sleep 8 
+ git submodule update --init --recursive 
Warning: Permanently added the RSA host key for IP address '140.82.113.3' to the list of known hosts.  
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl'

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 26 times.

codecov · 2020-10-16T06:30:50Z

Codecov Report

Merging #46234 (725856e) into master (996f444) will increase coverage by 7.56%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #46234       +/-   ##
===========================================
+ Coverage   60.82%   68.38%    +7.56%     
===========================================
  Files        2751      411     -2340     
  Lines      254434    53950   -200484     
===========================================
- Hits       154748    36894   -117854     
+ Misses      99686    17056    -82630

Impacted Files	Coverage Δ
torch/nn/utils/memory_format.py	`25.00% <0.00%> (-75.00%)`	⬇️
torch/backends/cudnn/rnn.py	`0.00% <0.00%> (-54.55%)`	⬇️
torch/multiprocessing/_atfork.py	`61.53% <0.00%> (-30.77%)`	⬇️
torch/cuda/random.py	`22.97% <0.00%> (-28.38%)`	⬇️
torch/backends/cudnn/__init__.py	`59.74% <0.00%> (-27.28%)`	⬇️
torch/cuda/__init__.py	`55.17% <0.00%> (-23.38%)`	⬇️
torch/__init__.py	`65.25% <0.00%> (-18.15%)`	⬇️
torch/nn/intrinsic/modules/fused.py	`84.00% <0.00%> (-16.00%)`	⬇️
torch/utils/hooks.py	`83.33% <0.00%> (-13.89%)`	⬇️
torch/quantization/quantization_mappings.py	`85.10% <0.00%> (-12.77%)`	⬇️
... and 2448 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996f444...6b3ffd4. Read the comment docs.

facebook-github-bot · 2020-11-02T19:11:55Z

Hi @mingfeima!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

parallel inference contiguous path parallel inference channels last path add dim apply optimize update stats add channels last support for backward Revert "add channels last support for backward" This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a. Revert "optimize update stats" This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643. Revert "add dim apply" This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd. bug fix

mingfeima · 2020-11-13T02:53:10Z

BatchNorm training forward path is paralleled but inference path is not paralleled at the moment. This explains why inference path is slower than training forward on multiple cores in batchnorm_M1_N256_K3136_cpu_bwdall .
Moving channels last impl to /cpu doesn't bring any performance improvement (gcc should be able to vectorize simple a * x + b pattern pretty well) but i still did so since the overall code structure will be cleaner. For example, you can keep just one version of batch_norm_cpu_inference_collect_linear_and_constant_terms.
I noticed the original impl has a special treatment for image_size=1, it is vectorized in this patch, so for K1 cases there are some improvement.
Since the vectorization pattern is separately treated for image_size=1 and channels=1, I add new tests to cover these scenarios.

Update performance with pytorch internal operator benchmark, machine Xeon(R) Gold 6248 CPU, 20 cores per socket, 2.5GHz. 1C refers to single core run, 20C refers to single socket run. jemalloc and numactrl are applied to reduce test result fluctuation.

Name	Tag	before (1C)	before (20C)	after (1C)	after (20C)
batchnorm_M1_N256_K3136_cpu	Short	248.093	252.302	255.109	22.736
batchnorm_M1_N256_K3136_cpu_bwdall	Short	3605.876	404.381	3620.332	401.902
batchnorm_M1_N256_K3136_cpu_bwd1	Short	3600.249	403.502	3573.064	403.201
batchnorm_M1_N8192_K1_cpu	Long	26.128	25.874	24.17	23.457
batchnorm_M1_N2048_K1_cpu	Long	10.377	10.455	9.845	9.978
batchnorm_M128_N8192_K1_cpu	Long	665.283	670.573	375.329	53.614
batchnorm_M128_N2048_K1_cpu	Long	177.345	184.846	98.43	22.127
batchnorm_M1_N8192_K1_cpu_bwdall	Long	16337.803	5278.381	15562.81	5200.735
batchnorm_M1_N8192_K1_cpu_bwd1	Long	16343.511	5712.935	15526.72	4797.94
batchnorm_M1_N2048_K1_cpu_bwdall	Long	4098.062	1397.278	3894.704	1289.964
batchnorm_M1_N2048_K1_cpu_bwd1	Long	4088.836	1297.217	3902.353	1390.161
batchnorm_M128_N8192_K1_cpu_bwdall	Long	36745.255	5309.559	34362.76	5134.023
batchnorm_M128_N8192_K1_cpu_bwd1	Long	36715.947	5316.02	34458	5102.021
batchnorm_M128_N2048_K1_cpu_bwdall	Long	7132.537	1307.126	7074.514	1284.771
batchnorm_M128_N2048_K1_cpu_bwd1	Long	7065.365	1361.857	6962.995	1343.29

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2020-11-16T20:55:10Z

Code looks good, but I see that you changed channels first AND channels last implementations. Can you please make sure to benchmark both memory layouts.

VitalyFedyunin

I will approve to continue process, but I will not land until benchmarks provided.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mingfeima · 2020-11-19T07:26:09Z

On the channels last format, inference performance is OK but training is not good enough since the current impl on training will parallel on dim of channels which leaves each NHW plain on nhwc format non-contiguous.

I need to continue to improve this PR, once finished I will stack this with other channels last pull requests.

mingfeima · 2020-12-07T05:03:11Z

please review the new one #48919

pytorchbot added the open source label Oct 13, 2020

mingfeima force-pushed the channels_last/batch_norm branch 2 times, most recently from bbb3a24 to 725856e Compare October 16, 2020 02:04

mingfeima mentioned this pull request Nov 5, 2020

add channels last support for AdaptiveAvgPool2d on CPU path #42104

Closed

facebook-github-bot added the cla signed label Nov 11, 2020

add batchnorm nhwc test for cpu, including C=1 and HW=1

6b3ffd4

mingfeima force-pushed the channels_last/batch_norm branch from 725856e to 6b3ffd4 Compare November 13, 2020 02:27

mingfeima changed the title ~~[WIP] BatchNorm channels last memory format optimization on CPU~~ BatchNorm channels last memory format optimization on CPU Nov 13, 2020

ezyang requested a review from VitalyFedyunin November 13, 2020 15:46

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 13, 2020

facebook-github-bot reviewed Nov 16, 2020

View reviewed changes

VitalyFedyunin added the module: memory format Memory format/layout related issues/changes (channels_last, nhwc) label Nov 16, 2020

VitalyFedyunin approved these changes Nov 17, 2020

View reviewed changes

facebook-github-bot reviewed Nov 17, 2020

View reviewed changes

mingfeima mentioned this pull request Dec 7, 2020

optimize channels last for BatchNorm2d on CPU #48919

Closed

mingfeima closed this Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm channels last memory format optimization on CPU #46234

BatchNorm channels last memory format optimization on CPU #46234

mingfeima commented Oct 13, 2020

dr-ci bot commented Oct 13, 2020 •

edited

codecov bot commented Oct 16, 2020 •

edited

facebook-github-bot commented Nov 2, 2020

mingfeima commented Nov 13, 2020

facebook-github-bot left a comment

VitalyFedyunin commented Nov 16, 2020

VitalyFedyunin left a comment

facebook-github-bot left a comment

mingfeima commented Nov 19, 2020

mingfeima commented Dec 7, 2020

BatchNorm channels last memory format optimization on CPU #46234

BatchNorm channels last memory format optimization on CPU #46234

Conversation

mingfeima commented Oct 13, 2020

dr-ci bot commented Oct 13, 2020 • edited

💊 CI failures summary and remediations

🕵️ 4 new failures recognized by patterns

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/4)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/4)

pytorch_xla_linux_bionic_py3_6_clang9_build (3/4)

binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build (4/4)

codecov bot commented Oct 16, 2020 • edited

Codecov Report

facebook-github-bot commented Nov 2, 2020

mingfeima commented Nov 13, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

VitalyFedyunin commented Nov 16, 2020

VitalyFedyunin left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

mingfeima commented Nov 19, 2020

mingfeima commented Dec 7, 2020

dr-ci bot commented Oct 13, 2020 •

edited

codecov bot commented Oct 16, 2020 •

edited