Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchNorm channels last memory format optimization on CPU #46234

Closed
wants to merge 2 commits into from

Conversation

mingfeima
Copy link
Collaborator

BatchNorm channels last memory format optimization on CPU

@dr-ci
Copy link

dr-ci bot commented Oct 13, 2020

💊 CI failures summary and remediations

As of commit 6b3ffd4 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build (4/4)

Step: "Checkout pytorch/builder repo" (full log | diagnosis details | 🔁 rerun)

fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878
+ git submodule update --init --recursive 
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl' 
+ sleep 4 
+ git submodule update --init --recursive 
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl' 
+ sleep 8 
+ git submodule update --init --recursive 
Warning: Permanently added the RSA host key for IP address '140.82.113.3' to the list of known hosts.  
fatal: reference is not a tree: cd5a9b73c3028d2496666201588111a8c8d84878 
Unable to checkout 'cd5a9b73c3028d2496666201588111a8c8d84878' in submodule path 'third_party/nccl/nccl' 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 26 times.

@mingfeima mingfeima force-pushed the channels_last/batch_norm branch 2 times, most recently from bbb3a24 to 725856e Compare October 16, 2020 02:04
@codecov
Copy link

codecov bot commented Oct 16, 2020

Codecov Report

Merging #46234 (725856e) into master (996f444) will increase coverage by 7.56%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #46234       +/-   ##
===========================================
+ Coverage   60.82%   68.38%    +7.56%     
===========================================
  Files        2751      411     -2340     
  Lines      254434    53950   -200484     
===========================================
- Hits       154748    36894   -117854     
+ Misses      99686    17056    -82630     
Impacted Files Coverage Δ
torch/nn/utils/memory_format.py 25.00% <0.00%> (-75.00%) ⬇️
torch/backends/cudnn/rnn.py 0.00% <0.00%> (-54.55%) ⬇️
torch/multiprocessing/_atfork.py 61.53% <0.00%> (-30.77%) ⬇️
torch/cuda/random.py 22.97% <0.00%> (-28.38%) ⬇️
torch/backends/cudnn/__init__.py 59.74% <0.00%> (-27.28%) ⬇️
torch/cuda/__init__.py 55.17% <0.00%> (-23.38%) ⬇️
torch/__init__.py 65.25% <0.00%> (-18.15%) ⬇️
torch/nn/intrinsic/modules/fused.py 84.00% <0.00%> (-16.00%) ⬇️
torch/utils/hooks.py 83.33% <0.00%> (-13.89%) ⬇️
torch/quantization/quantization_mappings.py 85.10% <0.00%> (-12.77%) ⬇️
... and 2448 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996f444...6b3ffd4. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

Hi @mingfeima!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

parallel inference contiguous path

parallel inference channels last path

add dim apply

optimize update stats

add channels last support for backward

Revert "add channels last support for backward"

This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a.

Revert "optimize update stats"

This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643.

Revert "add dim apply"

This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd.

bug fix
@mingfeima
Copy link
Collaborator Author

  • BatchNorm training forward path is paralleled but inference path is not paralleled at the moment. This explains why inference path is slower than training forward on multiple cores in batchnorm_M1_N256_K3136_cpu_bwdall .
  • Moving channels last impl to /cpu doesn't bring any performance improvement (gcc should be able to vectorize simple a * x + b pattern pretty well) but i still did so since the overall code structure will be cleaner. For example, you can keep just one version of batch_norm_cpu_inference_collect_linear_and_constant_terms.
  • I noticed the original impl has a special treatment for image_size=1, it is vectorized in this patch, so for K1 cases there are some improvement.
  • Since the vectorization pattern is separately treated for image_size=1 and channels=1, I add new tests to cover these scenarios.

Update performance with pytorch internal operator benchmark, machine Xeon(R) Gold 6248 CPU, 20 cores per socket, 2.5GHz. 1C refers to single core run, 20C refers to single socket run. jemalloc and numactrl are applied to reduce test result fluctuation.

Name Tag before (1C) before (20C) after (1C) after (20C)
batchnorm_M1_N256_K3136_cpu Short 248.093 252.302 255.109 22.736
batchnorm_M1_N256_K3136_cpu_bwdall Short 3605.876 404.381 3620.332 401.902
batchnorm_M1_N256_K3136_cpu_bwd1 Short 3600.249 403.502 3573.064 403.201
batchnorm_M1_N8192_K1_cpu Long 26.128 25.874 24.17 23.457
batchnorm_M1_N2048_K1_cpu Long 10.377 10.455 9.845 9.978
batchnorm_M128_N8192_K1_cpu Long 665.283 670.573 375.329 53.614
batchnorm_M128_N2048_K1_cpu Long 177.345 184.846 98.43 22.127
batchnorm_M1_N8192_K1_cpu_bwdall Long 16337.803 5278.381 15562.81 5200.735
batchnorm_M1_N8192_K1_cpu_bwd1 Long 16343.511 5712.935 15526.72 4797.94
batchnorm_M1_N2048_K1_cpu_bwdall Long 4098.062 1397.278 3894.704 1289.964
batchnorm_M1_N2048_K1_cpu_bwd1 Long 4088.836 1297.217 3902.353 1390.161
batchnorm_M128_N8192_K1_cpu_bwdall Long 36745.255 5309.559 34362.76 5134.023
batchnorm_M128_N8192_K1_cpu_bwd1 Long 36715.947 5316.02 34458 5102.021
batchnorm_M128_N2048_K1_cpu_bwdall Long 7132.537 1307.126 7074.514 1284.771
batchnorm_M128_N2048_K1_cpu_bwd1 Long 7065.365 1361.857 6962.995 1343.29

@mingfeima mingfeima changed the title [WIP] BatchNorm channels last memory format optimization on CPU BatchNorm channels last memory format optimization on CPU Nov 13, 2020
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 13, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Code looks good, but I see that you changed channels first AND channels last implementations. Can you please make sure to benchmark both memory layouts.

@VitalyFedyunin VitalyFedyunin added the module: memory format Memory format/layout related issues/changes (channels_last, nhwc) label Nov 16, 2020
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will approve to continue process, but I will not land until benchmarks provided.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mingfeima
Copy link
Collaborator Author

On the channels last format, inference performance is OK but training is not good enough since the current impl on training will parallel on dim of channels which leaves each NHW plain on nhwc format non-contiguous.

I need to continue to improve this PR, once finished I will stack this with other channels last pull requests.

@mingfeima
Copy link
Collaborator Author

please review the new one #48919

@mingfeima mingfeima closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: memory format Memory format/layout related issues/changes (channels_last, nhwc) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants