Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on Intel Cascade Lake machines #59098

Closed
imaginary-person opened this issue May 27, 2021 · 11 comments
Labels
low priority We're unlikely to get around to doing this in the near future oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@imaginary-person
Copy link
Contributor

imaginary-person commented May 27, 2021

TLDR

This issue has been established to be hardware-dependent, as it can be reproduced with the current master branch's source-code as well. Please refer to #59100. Debugging this issue can potentially uncover some latent issues with the Intel Cascade Lake series (gcc 9.3 or the hypervisor, etc not supporting it well, as per Occam's razor?).

🐛 Bug

Hello, in #56992, I'm trying to add AVX512 support to ATen.
test_lstm in quantization.bc.test_backward_compatibility.TestSerialization is failing on CI (please read Additional Context) for AVX512 kernels in the CI check pytorch_linux_bionic_py3_8_gcc9_coverage_test1.

However, if I clone the repo locally and build, I don't encounter this failure on machines with AVX512 support.
Even if I install the CI build artifact locally, I still don't encounter this failure. Please help me figure out the source of this error on CI. Thanks!

Reproduction steps (require a machine with AVX512 support)

  1. Either install the CI build artifact locally with,
    wget https://13766736-65600975-gh.circle-artifacts.com/0/home/circleci/project/dist/torch-1.9.0a0%2Bgitd895ec3-cp38-cp38-linux_x86_64.whl
    pip install torch-1.9.0a0+gitd895ec3-cp38-cp38-linux_x86_64.whl.

OR

Clone the repo of the PR locally,
git clone --recursive https://github.com/imaginary-person/pytorch-1 -b only_vec .
USE_MKL=1 USE_MKLDNN=1 USE_LLVM=/usr/lib/llvm-10 python setup.py develop

  1. Run the test with,
    python test/test_quantization.py -v -k test_lstm
    Two tests will run with this keyword.
    The test failing on CI is test_lstm of quantization.bc.test_backward_compatibility.TestSerialization.

The test fails with error,

File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/quantization/bc/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/var/lib/jenkins/workspace/test/quantization/bc/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

Expected behavior

The test should pass on a machine with AVX512 support, as it does locally.

Failing Environment

This issue has been established to be hardware-dependent, as it can be reproduced with the current master branch's source-code as well, which doesn't yet have AVX512-vectorized ATen kernels. Please refer to #59100.

Environments in which the test passes (locally on a cloud platform with full-metal hardware)

I tried to reproduce the issue on two different machines with Ubuntu 18.04 & Ubuntu 20.04, respectively.
Before testing, I uninstalled previous PyTorch versions.
While installing the CI build artifact, the installation took a few seconds (which is normal, as it suggests that the build didn't have to take place again).

After installing the CI pytorch_linux_bionic_py3_8_gcc9_coverage_build artifact,

  1. Ubuntu 18.04 machine (with bare-metal Intel Xeon Silver 4114 sockets, Skylake)
PyTorch version: 1.9.0a0+gitd895ec3

OS: Ubuntu 18.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-11ubuntu0~18.04.1) 9.3.0
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.8 (64-bit runtime)
Python platform: Linux-4.15.0-137-generic-x86_64-with-glibc2.27

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0a0+gitd895ec3

  1. Ubuntu 20.04 machine (with bare-metal Intel Xeon Gold 6142 sockets, Skylake)
OS: Ubuntu 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.4.0-67-generic-x86_64-with-glibc2.29

Additional context

Currently, Linux CI checks that are tests are run with large resource_class VMs, which, in turn, are run on machines without AVX512 support. For pytorch_linux_bionic_py3_8_gcc9_coverage_test1 & pytorch_linux_bionic_py3_8_gcc9_coverage_test2, I temporarily modified CircleCI config in that PR to use xlarge VMs, as they are run on machines with AVX512 support, so AVX512 kernels can be tested. This can be verified from the fact that ATEN_CPU_CAPABILITY for these modified CI checks was avx512 in the CI logs.

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel

@imaginary-person
Copy link
Contributor Author

imaginary-person commented May 27, 2021

@supriyar @jerryzh168,

When #43209 was reported, this issue was irreproducible at your end with AVX2 vectorization.
as it was only reproducible on some specific hardware (reportedly, Intel Cascade Lake machines).
All Linux CI checks that are tests are run on large resource_class VMs, which, in turn, are run on old machines that don't support AVX512.

In #56992, I used xlarge resource_class VMs for testing AVX512 kernels in 2 CI checks.
I'm guessing these VMs are run on Intel Cascade Lake machines, on which this issue was occurring for AVX2 kernels when #43209 was reported.

If you'd like to debug this issue, you can trigger CI checks in #56992 & SSH to the CI machine running pytorch_linux_bionic_py3_8_gcc9_coverage_test1.
Better still, I've opened #59100 to reproduce this issue on AVX2 ATen kernels, with the current master branch's code.

Maybe Intel folks, such as @mingfeima, can also help out.

Thank you!

@imaginary-person imaginary-person changed the title test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on CI for AVX512 ATen kernels, but passes locally test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on xlarge resource-class for AVX512 ATen kernels May 27, 2021
@imaginary-person imaginary-person changed the title test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on xlarge resource-class for AVX512 ATen kernels test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on xlarge resource-class May 27, 2021
@imaginary-person
Copy link
Contributor Author

imaginary-person commented May 27, 2021

Hello @supriyar, @jerryzh168 & @mingfeima, on the current master branch's source-code, this issue can be reproduced with an xlarge executor in #59100.

test_lstm (quantization.bc.test_backward_compatibility.TestSerialization) ... FAIL (0.046s)

Hence it's been established that this issue is hardware-dependent.
While CircleCI uses AWS, even GCP & Microsoft Azure have procured a lot of Intel Cascade Lake machines in the last 2 years, and usually offer them as distinct instance-classes. So, debugging this issue is probably worth it, as many PyTorch users are probably already using these machines.

cc @ezyang @malfet, as this test might've to be skipped on CI when AVX512 support for ATen would've to be added, but a more pressing concern might be the possibility of the existence of any silent failures not currently covered by tests, that can lead to incorrect results on these machines.

Thank you!

@imaginary-person imaginary-person changed the title test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on xlarge resource-class test_lstm in quantization.bc.test_backward_compatibility.TestSerialization fails on Intel Cascade Lake machines May 27, 2021
@imaginary-person
Copy link
Contributor Author

imaginary-person commented May 28, 2021

@jerryzh168 @supriyar @mingfeima @CaoZhongZ

The machine is definitely a Cascade Lake one as it has
Family 6 Model 85, which is is Cascade Lake, as it also has the flag avx512_vnni.
The hypervisor is KVM.

Thanks!

@imaginary-person

This comment has been minimized.

@imaginary-person

This comment has been minimized.

@ngimel ngimel added the oncall: quantization Quantization support in PyTorch label Jun 1, 2021
@github-actions github-actions bot added this to Need Triage in Quantization Triage Jun 1, 2021
@jerryzh168 jerryzh168 added triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 1, 2021
@ezyang
Copy link
Contributor

ezyang commented Jun 1, 2021

Hey @imaginary-person, two Qs for you.

First, what do you need on your end to unblock? Skipping the test as known broken on Cascade seems sufficient to shut up CI and let you continue to test on xlarge.

Second, any chance we can isolate this to a simple C program or assembly that gets misrun on Cascade Lake? I'd go about this by minimizing the test case, first tracing it down to a single operator from the test, and then spelunking into the operator implementation to get out the specific code that is responsible for doing the computation that is turning out incorreclty.

@imaginary-person
Copy link
Contributor Author

Hello @ezyang,

  1. Thanks for confirming that it'd be okay for me to skip that test case.
    BTW, I was wondering if some silent-correction errors could also be happening with PyTorch built with gcc 9.3 on these machines,
    but that's a digression from implementing AVX512.

  2. Thank you! I can try to do so.
    BTW, this issue occurs with the AVX2 implementation as well, when run on a Cascade Lake machine.
    I don't have access to a Cascade Lake machine, so after finding the errant code, I can disassemble libtorch_cpu.so present in the build artifact of a CascadeLake CI machine & compare the difference with disassembled libtorch_cpu.so built with the same gcc version on a machine on which this test passes.

@imaginary-person
Copy link
Contributor Author

imaginary-person commented Jun 20, 2021

@jerryzh168 @supriyar,

FBGEMM leverages the AVX512_VNNI instruction set, if supported on a platform, and Cascade Lake supports it.
Although currently ATen isn't compiled with AVX512 support, FBGEMM is.
So, the output of test_lstm that fails on Cascade Lake machines is a bit different from machines not supporting AVX512_VNNI. Fortunately, it's not a compiler bug, so I'm closing this issue. Thanks!

CASCADE LAKE DIFFERENCE

Quantization Triage automation moved this from Need Triage to Done Jun 20, 2021
@imaginary-person
Copy link
Contributor Author

imaginary-person commented Jun 20, 2021

From the discussion in pytorch/FBGEMM#125, it seems that reduce_range should not be set to True if FBGEMM is compiled with AVX512_VNNI support, so reopening the issue, as it can be fixed, and this test need not be skipped on Cascade Lake CI machines.

Quantization Triage automation moved this from Done to In Progress Jun 20, 2021
ezyang added a commit that referenced this issue Jul 20, 2021
### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
… remove AVX support"

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
…ort"

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

ghstack-source-id: 97ce82d770c53ee43143945bcf123ad6f6f0de6d
Pull Request resolved: #61903
facebook-github-bot pushed a commit that referenced this issue Jul 22, 2021
Summary:
Pull Request resolved: #61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
@andrewor14 andrewor14 added low priority We're unlikely to get around to doing this in the near future and removed triage review labels Nov 17, 2023
@andrewor14
Copy link
Contributor

@imaginary-person Looks like it's been 2 years. Is this still an issue with the latest master? If not can we close this?

@imaginary-person
Copy link
Contributor Author

Closing issue. Thanks!

Quantization Triage automation moved this from In Progress to Done Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low priority We're unlikely to get around to doing this in the near future oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Development

No branches or pull requests

5 participants