Add AVX512 support in ATen & remove AVX support #56992

imaginary-person · 2021-04-27T05:44:08Z

Remaining Tasks

Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

Summary

This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable ATEN_AVX512_256=TRUE also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for CPU_CAPABILITY_AVX have been removed.
On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If --continue-through-failure is used, then test_compare_model_outputs_functional_static fails. But if this test is skipped, test_compare_model_outputs_conv_static fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.
One test is currently being skipped -
test_lstminquantization.bc - It fails only on Cascade Lake machines, irrespective of the ATEN_CPU_CAPABILITY used, because FBGEMM uses AVX512_VNNI on machines that support it. The value of reduce_range should be used as False on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to @ezyang for proposing AVX512_256 - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing hsub_pd & hadd_pd, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored vec/functional.h to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

Testing

vec_test_all_types was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
pytorch_linux_bionic_py3_8_gcc9_coverage_test1 & pytorch_linux_bionic_py3_8_gcc9_coverage_test2 are now using linux.2xlarge instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that this post with verifiable references is a must-read. Also, AVX512 would probably not hurt performance on a high-end machine, but measurements are recommended. In case it does, ATEN_AVX512_256=TRUE can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. FBGEMM uses AVX512_256 only on Xeon D processors, which are said to have poor AVX512 performance.

This official data is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for Cascade Lake -

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512. Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

facebook-github-bot · 2021-04-27T05:44:14Z

💊 CI failures summary and remediations

As of commit 6a059f0 (more details on the Dr. CI page and at hud.pytorch.org/pr/56992):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ezyang · 2021-04-27T14:49:35Z

I didn't look at the PR contents, but the written plan of action seems reasonable.

into only_vec

[skip ci]

aten/src/ATen/cpu/FlushDenormal.cpp

facebook-github-bot · 2021-07-12T15:50:32Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2021-07-12T16:03:25Z

btw next time submit this one with ghstack, makes it easier for me to preserve fbcode changes XD

imaginary-person · 2021-07-12T17:27:43Z

btw next time submit this one with ghstack, makes it easier for me to preserve fbcode changes XD

Sorry for the inconvenience, @ezyang!
I'll definitely do so in the future.

BTW, pytorch/benchmark is segfaulting due to some Vision dependency issue on the machine I can disable Intel turbo mode (but it runs fine on the machine on which I can't disable turbo mode), so I'd have to spend some more time on benchmarks 😞. I'll post the issue in its GitHub repo soon -

================================================================== test session starts =================================================================== platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /external/benchmark plugins: benchmark-3.4.1, hypothesis-6.14.1 collecting ... Fatal Python error: Segmentation fault

Current thread 0x00007f7241d28740 (most recent call first):
File "", line 219 in _call_with_frames_removed
File "", line 1166 in create_module
File "", line 556 in module_from_spec
File "", line 657 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1042 in _handle_fromlist
File "/usr/local/lib/python3.8/dist-packages/PIL/ImageFont.py", line 48 in
File "", line 219 in _call_with_frames_removed
File "", line 848 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1042 in _handle_fromlist
File "/external/vision/torchvision/utils.py", line 7 in
File "", line 219 in _call_with_frames_removed
File "", line 848 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1042 in _handle_fromlist
File "/external/vision/torchvision/init.py", line 10 in
File "", line 219 in _call_with_frames_removed
File "", line 848 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/external/benchmark/torchbenchmark/models/Background_Matting/functions.py", line 3 in
File "", line 219 in _call_with_frames_removed
File "", line 848 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/external/benchmark/torchbenchmark/models/Background_Matting/init.py", line 11 in
File "", line 219 in _call_with_frames_removed
File "", line 848 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 1014 in _gcd_import
File "/usr/lib/python3.8/importlib/init.py", line 127 in import_module
File "/external/benchmark/torchbenchmark/init.py", line 91 in list_models
File "/external/benchmark/test.py", line 83 in _load_tests
File "/external/benchmark/test.py", line 88 in
File "/usr/local/lib/python3.8/dist-packages/_pytest/assertion/rewrite.py", line 170 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 1014 in _gcd_import
File "/usr/lib/python3.8/importlib/init.py", line 127 in import_module
File "/usr/local/lib/python3.8/dist-packages/_pytest/pathlib.py", line 524 in import_path
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 578 in _importtestmodule
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 500 in _getobj
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 291 in obj
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 516 in _inject_setup_module_fixture
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 503 in collect
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 341 in
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 311 in from_call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 341 in pytest_make_collect_report
File "/usr/local/lib/python3.8/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 84 in
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/hooks.py", line 286 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 458 in collect_one_node
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 808 in genitems
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 634 in perform_collect
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 333 in pytest_collection
File "/usr/local/lib/python3.8/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 84 in
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/hooks.py", line 286 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 322 in _main
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 269 in wrap_session
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/usr/local/lib/python3.8/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 84 in
File "/usr/local/lib/python3.8/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/hooks.py", line 286 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 162 in main
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 185 in console_main
File "/usr/local/bin/pytest", line 8 in
Segmentation fault (core dumped)

imaginary-person · 2021-07-13T00:03:48Z

Hello @ezyang, I opened a pytorch/benchmark issue.
I think it's related to some package dependency. The segfault occurs in the collect stage, before any tests are run.

BTW, the FB Internal tests finished so soon this time around. There are 31 warning_cancelled tests. Do they require more changes to internal FB code? Thanks!

imaginary-person · 2021-07-13T03:09:19Z

BTW, @ezyang, I can still use ghstack, if you'd like me to. Thanks!

imaginary-person · 2021-07-13T14:23:54Z

Hello @ezyang, thanks for your patience! Intel Advisor deadlocks with pytorch/benchmark anyway, so I won't be able to create Roofline plots anyway, and will just report the runtimes for them.

EDIT: The deadlocks are because of OpenBLAS threads used by numpy, so I built without numpy, and will make graphs.

…cpp" This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. [ghstack-poisoned]

This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. [ghstack-poisoned]

ezyang · 2021-07-19T03:35:57Z

BTW, @ezyang, I can still use ghstack, if you'd like me to. Thanks!

I'd say if you do any more updates, upload a new diff with ghstack. Thanks!

…cpp" This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. Differential Revision: [D29753536](https://our.internmc.facebook.com/intern/diff/D29753536) [ghstack-poisoned]

This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. Differential Revision: [D29753536](https://our.internmc.facebook.com/intern/diff/D29753536) [ghstack-poisoned]

Summary: Pull Request resolved: #61483 This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29753536 Pulled By: ngimel fbshipit-source-id: 03ae66cdc01a3679c67214468e2bdf93b15c3bc2

ezyang · 2021-07-20T13:49:28Z

I'm disconnecting the diff from this PR so I can test my merge resolutions (do not merge here, I've already merged)

@ezyang

### Remaining Tasks - [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP). ### Summary 1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE` also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed. 2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415). It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now. 3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now. 4. One test is currently being skipped - [test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines. The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d. Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses. Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code. Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests. ### Testing 1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2. Only one test had to be modified, as it was hardcoded for AVX2. 2. `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support. ### Would the downclocking caused by AVX512 pose an issue? I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance. This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance. Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) - ![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG) ![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG) The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them. ### Is PyTorch always faster with AVX512? No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512. It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed. Original pull request: #56992 Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)! [ghstack-poisoned]

imaginary-person · 2021-07-20T15:52:55Z

I'm disconnecting the diff from this PR so I can test my merge resolutions (do not merge here, I've already merged)

Hello @ezyang, I'm a bit confused. Is this comment meant for other reviewers?

BTW, there were some merge-conflicts as nansum was optimized recently, so I resolved them.
I didn't use ghstack, as I think it requires distinct commits to be able to successfully build individually. Thanks!

ezyang · 2021-07-20T17:07:22Z

Hello @ezyang, I'm a bit confused. Is this comment meant for other reviewers?

No it's for you :P

I didn't use ghstack, as I think it requires distinct commits to be able to successfully build individually. Thanks!

There's an ancillary benefit to using ghstack which is that the machinery for importing ghstack diffs knows how to preserve fb-only changes (but not the regular machinery). That's why I swapped it. The new PR is #61903 (which is a ghstack PR)

ezyang · 2021-07-20T17:14:09Z

If you can post updates to the other ghstack, that would be a great help. I'll manually reapply your most recent changes.

@ezyang

… remove AVX support" ### Remaining Tasks - [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP). ### Summary 1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE` also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed. 2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415). It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now. 3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now. 4. One test is currently being skipped - [test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines. The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d. Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses. Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code. Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests. ### Testing 1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2. Only one test had to be modified, as it was hardcoded for AVX2. 2. `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support. ### Would the downclocking caused by AVX512 pose an issue? I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance. This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance. Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) - ![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG) ![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG) The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them. ### Is PyTorch always faster with AVX512? No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512. It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed. Original pull request: #56992 Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)! [ghstack-poisoned]

@ezyang

…ort" ### Remaining Tasks - [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP). ### Summary 1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE` also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed. 2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415). It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now. 3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now. 4. One test is currently being skipped - [test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines. The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d. Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses. Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code. Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests. ### Testing 1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2. Only one test had to be modified, as it was hardcoded for AVX2. 2. `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support. ### Would the downclocking caused by AVX512 pose an issue? I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance. This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance. Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) - ![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG) ![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG) The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them. ### Is PyTorch always faster with AVX512? No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512. It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed. Original pull request: #56992 Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)! [ghstack-poisoned]

@ezyang

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP). 1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE` also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed. 2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415). It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now. 3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now. 4. One test is currently being skipped - [test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines. The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d. Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses. Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code. Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests. 1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2. Only one test had to be modified, as it was hardcoded for AVX2. 2. `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support. I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance. This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance. Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) - ![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG) ![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG) The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them. No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512. It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed. Original pull request: #56992 Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)! ghstack-source-id: 97ce82d770c53ee43143945bcf123ad6f6f0de6d Pull Request resolved: #61903

Summary: Pull Request resolved: #61903 ### Remaining Tasks - [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP). ### Summary 1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE` also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed. 2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415). It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now. 3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now. 4. One test is currently being skipped - [test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines. The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d. Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses. Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code. Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests. ### Testing 1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2. Only one test had to be modified, as it was hardcoded for AVX2. 2. `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support. ### Would the downclocking caused by AVX512 pose an issue? I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance. This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance. Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) - ![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG) ![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG) The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them. ### Is PyTorch always faster with AVX512? No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512. It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed. Original pull request: #56992 Reviewed By: soulitzer Differential Revision: D29266289 Pulled By: ezyang fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184

ezyang · 2021-08-09T23:51:03Z

this got landed!!

sanchit and others added 3 commits April 26, 2021 20:38

Rename namespace vec256 to vec

2fc5d65

Disable AVX

45cfd55

vec is the new vec256

ee07190

facebook-github-bot added the cla signed label Apr 27, 2021

imaginary-person mentioned this pull request Apr 27, 2021

AVX512 and Vec512 #56187

Open

pytorchbot added the open source label Apr 27, 2021

imaginary-person and others added 5 commits April 27, 2021 13:55

Fix aot test

46cb5b0

Disable AVX for Bazel builds

7b69acc

Add AVX512 to aten.bzl later

7a56c94

Fix test_cpp_extensions_aot_no_ninja

66fd585

Merge branch 'only_vec' of https://github.com/imaginary-person/pytorch-1

9d07871

into only_vec

imaginary-person closed this Apr 27, 2021

imaginary-person deleted the only_vec branch April 27, 2021 19:49

imaginary-person restored the only_vec branch April 27, 2021 19:49

imaginary-person reopened this Apr 27, 2021

imaginary-person and others added 4 commits April 27, 2021 14:57

Trigger CI to check if builds are using phantom code again

66b7c89

Get latest changes from the main repo

e96e3ce

[skip ci]

Undo a change by pytorch#56704

82ae5fa

Undo change introduced by 55380

f104e2a

This comment has been minimized.

Sign in to view

AVX512 support added

eebdcde

imaginary-person changed the title ~~[Proof of concept] A unified vec namespace~~ AVX512 support in a unified vec namespace May 1, 2021

imaginary-person changed the title ~~AVX512 support in a unified vec namespace~~ AVX512 support in a unified vec namespace in ATen May 1, 2021

imaginary-person changed the title ~~AVX512 support in a unified vec namespace in ATen~~ AVX512 support with a unified vec namespace in ATen May 1, 2021

imaginary-person commented May 1, 2021

View reviewed changes

aten/src/ATen/cpu/FlushDenormal.cpp Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

imaginary-person changed the title ~~AVX512 support with a unified vec namespace in ATen~~ AVX512 support in ATen with a unified vec namespace May 1, 2021

This comment has been minimized.

Sign in to view

peterbell10 mentioned this pull request Jul 9, 2021

Factor vector intrinsics out of SumKernel.cpp #61483

Closed

peterbell10 added a commit that referenced this pull request Jul 15, 2021

Update base for Update on "Factor vector intrinsics out of SumKernel.…

d531bed

…cpp" This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. [ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jul 15, 2021

Update on "Factor vector intrinsics out of SumKernel.cpp"

8d9a7e2

This will make it simpler to support AVX512 which is upcoming in #56992, see #56992 (comment) for reference. [ghstack-poisoned]

ezyang mentioned this pull request Jul 20, 2021

[pytorch][PR] Add AVX512 support in ATen & remove AVX support #61903

Closed

1 task

This comment has been minimized.

Sign in to view

Merge master

cfcd9e2

imaginary-person added 2 commits July 20, 2021 11:28

Fix Windows build

ecbc8de

Fix Windows build

6a059f0

ezyang closed this Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX512 support in ATen & remove AVX support #56992

Add AVX512 support in ATen & remove AVX support #56992

imaginary-person commented Apr 27, 2021 •

edited

facebook-github-bot commented Apr 27, 2021 •

edited

ezyang commented Apr 27, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

facebook-github-bot commented Jul 12, 2021

ezyang commented Jul 12, 2021

imaginary-person commented Jul 12, 2021

imaginary-person commented Jul 13, 2021 •

edited

imaginary-person commented Jul 13, 2021

imaginary-person commented Jul 13, 2021 •

edited

ezyang commented Jul 19, 2021

ezyang commented Jul 20, 2021

This comment has been minimized.

imaginary-person commented Jul 20, 2021 •

edited

ezyang commented Jul 20, 2021

ezyang commented Jul 20, 2021

ezyang commented Aug 9, 2021

Add AVX512 support in ATen & remove AVX support #56992

Add AVX512 support in ATen & remove AVX support #56992

Conversation

imaginary-person commented Apr 27, 2021 • edited

Remaining Tasks

Summary

Testing

Would the downclocking caused by AVX512 pose an issue?

Is PyTorch always faster with AVX512?

facebook-github-bot commented Apr 27, 2021 • edited

💊 CI failures summary and remediations

ezyang commented Apr 27, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

facebook-github-bot commented Jul 12, 2021

ezyang commented Jul 12, 2021

imaginary-person commented Jul 12, 2021

imaginary-person commented Jul 13, 2021 • edited

imaginary-person commented Jul 13, 2021

imaginary-person commented Jul 13, 2021 • edited

ezyang commented Jul 19, 2021

ezyang commented Jul 20, 2021

This comment has been minimized.

imaginary-person commented Jul 20, 2021 • edited

ezyang commented Jul 20, 2021

ezyang commented Jul 20, 2021

ezyang commented Aug 9, 2021

imaginary-person commented Apr 27, 2021 •

edited

facebook-github-bot commented Apr 27, 2021 •

edited

imaginary-person commented Jul 13, 2021 •

edited

imaginary-person commented Jul 13, 2021 •

edited

imaginary-person commented Jul 20, 2021 •

edited