Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestReductionsCPU.test_nansum_out_dtype_cpu is flaky, but its flakiness is currently hidden #59415

Closed
Tracked by #61417
imaginary-person opened this issue Jun 3, 2021 · 5 comments
Assignees
Labels
module: reductions module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@imaginary-person
Copy link
Contributor

imaginary-person commented Jun 3, 2021

🐛 Bug

TestReductionsCPU.test_nansum_out_dtype_cpu is flaky, but its flakiness is currently hidden.

Currently, this test iterates on a combination of two supported dtypes (the input dtype & output dtype):

def test_nansum_out_dtype(self, device):
dtypes = list(torch.testing.get_all_int_dtypes() + torch.testing.get_all_fp_dtypes(include_bfloat16=False))
for inp_dtype, out_dtype in combinations(dtypes, 2):

If one were to print those combinations, one would get a list, some of whose elements are, in order:

[[torch.uint8, torch.int8], [torch.uint8, torch.int16], [torch.uint8, torch.int32],
       [torch.uint8, torch.int64], [torch.uint8, torch.float64], [torch.uint8, torch.float32], [torch.uint8, torch.float16]]

However, the test fails if one were to swap [torch.uint8, torch.float16] with [torch.uint8, torch.float32], i.e. something like,

[[torch.uint8, torch.int8], [torch.uint8, torch.int16], [torch.uint8, torch.int32],
       [torch.uint8, torch.int64], [torch.uint8, torch.float64], [torch.uint8, torch.float16], [torch.uint8, torch.float32]]

Steps to reproduce the behavior:

  1. Replace lines 1113 & 1114 of test_nansum_out_dtype in test_reductions.py with
        for inp_dtype, out_dtype in [[torch.uint8, torch.int8], [torch.uint8, torch.int16], [torch.uint8, torch.int32],
              [torch.uint8, torch.int64], [torch.uint8, torch.float64], [torch.uint8, torch.float16], [torch.uint8, torch.float32]]:
  1. Run the test with the -v option. The test fails with the error:
Traceback (most recent call last):
  File "/pytorch/torch/testing/_internal/common_device_type.py", line 292, in instantiated_test
    result = test_fn(self, *args)
  File "test_reductions.py", line 1121, in test_nansum_out_dtype
    self.compare_with_numpy(torch_fn, np_fn, x, device=None, dtype=None)
  File "/pytorch/torch/testing/_internal/common_utils.py", line 1145, in compare_with_numpy
    self.assertEqual(np_result, torch_result, **kwargs)
  File "/pytorch/torch/testing/_internal/common_utils.py", line 1272, in assertEqual
    self.assertEqual(x, y.item(), atol=atol, rtol=rtol, msg=msg,
  File "/pytorch/torch/testing/_internal/common_utils.py", line 1403, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 4448.0 and 4444.0 gives a difference of 4.0, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 0.0057872!
  1. The above issue can also be reproduced with the following snippet of Python code-
import torch
import numpy
x = torch.tensor([[66, 63, 21, 84, 86, 90, 86, 21],
                            [81, 68, 22, 39, 43, 38, 24, 63],
                            [96, 98, 33, 85, 90, 87, 52, 99],
                            [56, 21, 95, 59, 85, 51, 16, 37],
                            [74, 55, 95, 97, 45, 82, 71, 33],
                            [88, 90, 91, 68, 31, 22, 53, 30],
                            [68, 22, 73, 82, 36, 56, 96, 20],
                            [71, 66, 82, 17, 97, 35, 20, 43],
                            [31, 81, 87, 90, 60, 49, 96, 91]], dtype=torch.uint8)
y = torch.nansum(x, dtype=torch.float16)
z = numpy.nansum(x.detach().cpu().numpy(), dtype=numpy.float16)
# y is 4444.0, but z is 4450.0
  1. Changing the order further can make this test pass, as can removing some entries from the list.

Expected behavior

This test should pass regardless of the order of the [inp_dtype, out_dtype] pairs.

Environment

Source of the current master branch.

PyTorch version: 1.10.0a0+git1aa14fc
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.4.0-67-generic-x86_64-with-glibc2.29

Versions of relevant libraries:
[pip3] numpy==1.20.3

Additional context

On CUDA, step 3 evaluates to 4448.0 on PyTorch 1.8.1+cu101 on Google Colab, as opposed to 4444.0 by a CPU build based on the current master branch (as well as the v1.8.1 CPU implementation on Google Colab).
numpy evaluates it to 4450.0.

cc @heitorschueroff @mruberry @VitalyFedyunin @walterddr

@mruberry mruberry added module: reductions triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: tests Issues related to tests (not the torch.testing module) and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 4, 2021
@mruberry
Copy link
Collaborator

mruberry commented Jun 4, 2021

Thank you for the clear write-up and thorough reporting of this issue, @imaginary-person!

@imaginary-person
Copy link
Contributor Author

imaginary-person commented Jun 4, 2021

Sorry @mruberry, I probably should've also mentioned that this issue only occurs if the output dtype is half.

@imaginary-person

This comment has been minimized.

@imaginary-person
Copy link
Contributor Author

imaginary-person commented Jun 30, 2021

@peterbell10,

Thanks for the great work in #60387!
Whenever you can, please advise as to whether porting nansum to a SumKernel.cpp like implementation (or unifying them) would be non-trivial, as it'd probably fix this issue. Moreover, the comments in ReduceOpsKernel.cpp (written by @kshitij12345, and suggested by @mruberry) state that nansum should have a SumKernel like implementation -

// TODO: Implement `nansum` similar to the stable `sum`
// implementation in cpu/SumKernel.cpp

Thank you!

@peterbell10
Copy link
Collaborator

Actually, I think the LoadPolicy customisation point I added in #60387 should make it fairly straight-forward for nansum to share the cascade-sum code. I'll give that a shot now.

@peterbell10 peterbell10 self-assigned this Jun 30, 2021
peterbell10 added a commit that referenced this issue Jun 30, 2021
Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 1, 2021
…curacy"

Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 1, 2021
Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 9, 2021
…curacy"

Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 9, 2021
Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 15, 2021
…curacy"

Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 15, 2021
Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 19, 2021
…curacy"

Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

Differential Revision: [D29753523](https://our.internmc.facebook.com/intern/diff/D29753523)

[ghstack-poisoned]
peterbell10 added a commit that referenced this issue Jul 19, 2021
Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

Differential Revision: [D29753523](https://our.internmc.facebook.com/intern/diff/D29753523)

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
… remove AVX support"

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
…ort"

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.


Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.


### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

[ghstack-poisoned]
ezyang added a commit that referenced this issue Jul 20, 2021
- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to @ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to @limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to @quickwritereader for helping fix 4 failing complex multiplication & division tests.

1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Differential Revision: [D29266289](https://our.internmc.facebook.com/intern/diff/D29266289/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29266289/)!

ghstack-source-id: 97ce82d770c53ee43143945bcf123ad6f6f0de6d
Pull Request resolved: #61903
facebook-github-bot pushed a commit that referenced this issue Jul 22, 2021
Summary:
Pull Request resolved: #61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](#59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](pytorch/FBGEMM#209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: #56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: reductions module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants