Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.arange numerics are different after 1.7 update on CPU #47043

Closed
coquelin77 opened this issue Oct 29, 2020 · 30 comments
Closed

torch.arange numerics are different after 1.7 update on CPU #47043

coquelin77 opened this issue Oct 29, 2020 · 30 comments
Assignees
Labels
high priority module: numerical-stability Problems related to numerical stability of operations module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@coquelin77
Copy link

coquelin77 commented Oct 29, 2020

馃悰 Bug

The floor operation is not working properly for float64 dtype. other dtypes not tested.

To Reproduce

Steps to reproduce the behavior:

(base) python
Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.arange(-5.0, 5.0, 1.4)
>>> x
tensor([-5.0000, -3.6000, -2.2000, -0.8000,  0.6000,  2.0000,  3.4000,  4.8000])
>>> x.floor()
tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.])
>>> x = torch.arange(-5.0, 5.0, 1.4, dtype=torch.float64)
>>> x
tensor([-5.0000, -3.6000, -2.2000, -0.8000,  0.6000,  2.0000,  3.4000,  4.8000],
       dtype=torch.float64)
>>> x.floor()
tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], dtype=torch.float64)

Expected behavior

see code snippet.

Environment

PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 9.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.10 (x86_64)
GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.0
[pip3] torchvision==0.8.1
[conda] _tflow_select             2.3.0                       mkl  
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               9.2                           0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.2.0            py37h23d657b_0  
[conda] mkl_random                1.1.1            py37h0573a6f_0  
[conda] numpy                     1.19.2           py37h54aff64_0  
[conda] numpy-base                1.19.2           py37hfa32c7d_0  
[conda] pytorch                   1.7.0           py3.7_cuda9.2.148_cudnn7.6.3_0    pytorch
[conda] tensorflow                2.2.0           mkl_py37h6e9ce2d_0  
[conda] tensorflow-base           2.2.0           mkl_py37hd506778_0  
[conda] torchvision               0.8.1                 py37_cu92    pytorch

Additional context

Only tested on CPU.

cc @ezyang @gchanan @zou3519 @bdhirsh @heitorschueroff

@coquelin77
Copy link
Author

this is a machine epsilon error i believe. not sure why it ran clean on previous versions

@albanD
Copy link
Collaborator

albanD commented Oct 29, 2020

You mean that the 2. returned by the arange is atually 2-eps ?

Hi-pri to find the root cause.

@albanD albanD added high priority module: numerical-stability Problems related to numerical stability of operations module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 29, 2020
@coquelin77
Copy link
Author

yes. but only for this specific case. since 1.4 in hex would be 0x1.66666666666p+0 and that isnt easily represented by a float. by the time it gets to the 2. value in the arange it is less than 2.0 but above 2-eps.

@albanD
Copy link
Collaborator

albanD commented Oct 29, 2020

Ok that sounds good. I am not sure if there is anything we can do on the pytorch side here?

@gchanan
Copy link
Contributor

gchanan commented Oct 30, 2020

this doesn't explain why the behavior changes in 1.7 (I confirmed it does).

@coquelin77
Copy link
Author

does it happen in the nightly build as well? perhaps the release build for 1.7 was built with an updated compiler and there was something that changed there.

@gchanan
Copy link
Contributor

gchanan commented Nov 2, 2020

I confirmed it's a problem on a recent version of master.

@ngimel
Copy link
Collaborator

ngimel commented Nov 2, 2020

Leaving high priority to figure out why vectorized path gives a different answer.

@mruberry
Copy link
Collaborator

mruberry commented Nov 2, 2020

Possibly changed by #38697.

@gchanan
Copy link
Contributor

gchanan commented Nov 2, 2020

Confirmed it is caused by #38697.

@gchanan gchanan changed the title floor with float64 not working properly after 1.7 update on CPU torch.arange numerics are different after 1.7 update on CPU Nov 3, 2020
@gchanan
Copy link
Contributor

gchanan commented Nov 3, 2020

Ok so #38697 doesn't really vectorize the computation, it unrolls it.

As pointed out by @ngimel, the difference is basically:

>>> a=torch.tensor(-5., dtype = torch.double)
>>> b=torch.tensor(1.4, dtype=torch.double)

>>> a+5*b - 2 #this was computed before, produce exact 2
tensor(0., dtype=torch.float64)

>>> a+4*b+b-2 this is what's computed now
tensor(-4.4409e-16, dtype=torch.float64)

@gchanan
Copy link
Contributor

gchanan commented Nov 3, 2020

Also, numpy produces the "exact" result (that doesn't mean it does for all cases, I didn't check the implementation):

>>> x = np.arange(-5.0, 5.0, 1.4, dtype=np.float64)
>>> np.floor(x)
array([-5., -4., -3., -1.,  0.,  2.,  3.,  4.])

@kurtamohler kurtamohler self-assigned this Nov 5, 2020
@kurtamohler
Copy link
Collaborator

I noticed that CUDA has a similar problem, but in that case it is not a new issue.

import torch
print('device dtype arange-floor-result')
for device in ['cpu', 'cuda']:
    for dtype in [torch.float32, torch.float64]:
        result = torch.arange(-5, 5, 1.4, device=device, dtype=dtype)
        print(f'{device} {dtype} {result.floor()}')
print()
print('device dtype arange-item-5')
for device in ['cpu', 'cuda']:
    for dtype in [torch.float32, torch.float64]:
        result = torch.arange(-5, 5, 1.4, device=device, dtype=dtype)[5].item()
        print(f'{device} {dtype} {result}')

Running the above script in Pytorch 1.6 shows this:

cpu torch.float32 tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.])
cpu torch.float64 tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.], dtype=torch.float64)
cuda torch.float32 tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], device='cuda:0')
cuda torch.float64 tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], device='cuda:0', dtype=torch.float64)

device dtype arange-item-5
cpu torch.float32 2.0
cpu torch.float64 2.0
cuda torch.float32 1.9999998807907104
cuda torch.float64 1.9999999999999996

Running on the current main branch shows this:

cpu torch.float32 tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.])
cpu torch.float64 tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], dtype=torch.float64)
cuda torch.float32 tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], device='cuda:0')
cuda torch.float64 tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], device='cuda:0', dtype=torch.float64)

device dtype arange-item-5
cpu torch.float32 2.0
cpu torch.float64 1.9999999999999996
cuda torch.float32 1.9999998807907104
cuda torch.float64 1.9999999999999996

Should a separate issue be opened for the difference between CUDA and CPU as well?

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 5, 2020

It seems to me that we can solve this issue by creating a Vec256::arange_index_offset() function to use instead of the Vec256::arange() function. It would look something like this:

  template<typename step_t>
  static Vec256<T> arange_index_offset(T base = static_cast<T>(0), step_t step = static_cast<step_t>(1), int64_t index_offset = 0) {
    Vec256 vec;
    int64_t index_end = index_offset + size();
    for (int64_t i = index_offset; i < index_end; i++) {
      vec.values[i] = base + i * step;
    }   
    return vec;
  }

Currently, each call to Vec256::arange() underneath the torch.arange() call is given a different base argument, so the element-wise result is like start + (base_offset * step) + (vec256_offset * step). With the introduction of Vec256::arange_index_offset(), the element-wise result would be start + (index_offset + vec256_offset) * step, which should solve this issue as far as I understand.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 6, 2020

Well I implemented arange_index_offset and it did not solve the problem on its own. I tried disabling vectorization, just to see if the simplified formula would fix the numerical issue, but the issue still persists. This is the kernel without vectorization:

      cpu_serial_kernel(
          it,
          [start, step, &idx]() -> scalar_t {
            return start + step * (idx++);
          },
          {p_begin, p_end});

Since that formula doesn't fix the issue, I think the problem is slightly more nuanced. There must be an additional numerical problem somewhere. I'm looking into what it might be.

@ngimel
Copy link
Collaborator

ngimel commented Nov 6, 2020

It's baffling, because gpu kernel is doing pretty much what formula above suggests, and yet it still has numerical issues.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 6, 2020

I believe the issue is AVX2. Consider the following program, which essentially performs a simplified torch.arange(-5, 5, 1.4, dtype=torch.double) and prints the result:

arange_test.cpp

#include <stdio.h>
#include <stdlib.h>

int main() {
  double start = -5; 
  double step = 1.4;
  int64_t num_steps = 8;

  printf("step: %.20e\n", step);
  printf("start: %.20e\n", start);

  for (int64_t ind = 0; ind < num_steps; ind++) {
    printf("%ld: %.20e\n", ind, (start + step * ind));
  }
}

Build it with and without AVX2:

Makefile

all: arange_test arange_test_avx2

arange_test: arange_test.cpp
	g++ -o $@ $^ -std=gnu++14 -O3

arange_test_avx2: arange_test.cpp
	g++ -o $@ $^ -std=gnu++14 -O3  -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2 -MD

We get different results with each binary:

$ ./arange_test
step: 1.39999999999999991118e+00
start: -5.00000000000000000000e+00
0: -5.00000000000000000000e+00
1: -3.60000000000000008882e+00
2: -2.20000000000000017764e+00
3: -8.00000000000000710543e-01
4: 5.99999999999999644729e-01
5: 2.00000000000000000000e+00
6: 3.39999999999999857891e+00
7: 4.79999999999999893419e+00

$ ./arange_test_avx2 
step: 1.39999999999999991118e+00
start: -5.00000000000000000000e+00
0: -5.00000000000000000000e+00
1: -3.60000000000000008882e+00
2: -2.20000000000000017764e+00
3: -8.00000000000000266454e-01
4: 5.99999999999999644729e-01
5: 1.99999999999999955591e+00
6: 3.39999999999999946709e+00
7: 4.79999999999999893419e+00

The AVX2 version matches today's Pytorch result if vectorization is turned off, and I suspect the non-AVX2 version matches what Pytorch used to give. The most notable difference is element 5--exactly 2 without AVX2, and slightly lower than 2 with AVX2.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 6, 2020

To summarize, there are two CPU numerical problems:

  • Vec256::arange's formula introduces extra numerical error
  • AVX2 gives a different result than non-AVX2 code

To confirm, I ran torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor() in the following ways to get the following results:

Using the current Pytorch 1.7 release:

$ ATEN_CPU_CAPABILITY=avx2 python
>>> import torch
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor()
tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], dtype=torch.float64)

$ ATEN_CPU_CAPABILITY=avx python
>>> import torch
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor()
tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], dtype=torch.float64)

Using my Vec256::arange_index_offset instead of Vec256::arange:

$ ATEN_CPU_CAPABILITY=avx2 python
>>> import torch
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor()
tensor([-5., -4., -3., -1.,  0.,  1.,  3.,  4.], dtype=torch.float64)


$ ATEN_CPU_CAPABILITY=avx python
>>> import torch
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor()
tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.], dtype=torch.float64)

So we only get the correct result in one of the above four cases: if we use AVX instead of AVX2 and we use Vec256::arange_index_offset instead of Vec256::arange.

I can create a PR to introduce Vec256::arange_index_offset. But I think we still need to discuss what (if anything) we can do about AVX2. Is there a way to disable AVX2 support for just one operation?

@ezyang
Copy link
Contributor

ezyang commented Nov 9, 2020

We can certainly disable AVX2 support for one operation; we already compile kernels multiple times per AVX version, so simplest would be just to wrap the operation in a preprocessor macro that omits the DispatchStub registration when AVX2 is detected.

Fixing the problem by disabling AVX2 does seem a bit goofy though...

@mruberry
Copy link
Collaborator

mruberry commented Nov 10, 2020

I propose we just could just revert the PR (or the part of it) that "vectorized" arange. This would leave the CUDA issue, but people don't seem as sensitive to it. We should also add a test case for this behavior so if we try to improve the performance of arange again we detect this.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 11, 2020

The problem appears to actually be the -mfma compiler flag. On my machine, Pytorch builds AVX2 with these flags:

-std=gnu++14 -O3 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2 -MD

But AVX is built with these flags:

-std=gnu++14 -O3 -mavx -DCPU_CAPABILITY=AVX -DCPU_CAPABILITY_AVX -MD

Indeed, if I remove the -mfma flag from my small test program above, the AVX2 binary produces the correct result (meaning that element 5 of the result is exactly 2, not 1.999999...). And if I use our AVX flags, the result is correct as well.

Here's the code for this experiment: https://github.com/kurtamohler/pytorch-perf-test-scripts/tree/master/arange_compiler_flags

Looking at these in godbolt shows that an FMA instruction is in fact used only if -mfma is included.

I think it could be possible to solve this issue by using AVX2 intrinsics to possibly prevent the compiler from being able to substitute with an FMA. Or perhaps there is some macro that would tell the compiler to not use FMA for the specified section of code. I'll look into whether a macro like that exists.

@kurtamohler
Copy link
Collaborator

Another possible solution is to separate code that we don't want to use FMA into a different file, and avoid using the -mfma flag on that file.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 11, 2020

I tried using intrinsics to see if it would avoid FMA, and it didn't work at first--but then I found a way to get it working. First, I changed from this

  template<typename step_t>
  static Vec256<double> arange_index_offset(double base = 0., step_t step = static_cast<step_t>(1), int64_t index_offset = 0) {
    return Vec256<double>(
        base + step * index_offset,
        base + step * (index_offset + 1),
        base + step * (index_offset + 2),
        base + step * (index_offset + 3));
  }

to this:

  template<typename step_t>
  static Vec256<double> arange_index_offset(double base = 0., step_t step = static_cast<step_t>(1), int64_t index_offset = 0) {
    Vec256<double> base_v(base, base, base, base);
    Vec256<double> step_v(step, step, step, step);
    Vec256<double> index_v(index_offset, index_offset + 1, index_offset + 2, index_offset + 3); 
    return _mm256_add_pd(base_v, _mm256_mul_pd(step_v, index_v));
  }

Evidently the _mm256_add_pd and _mm256_mul_pd calls get combined such that I'm effectively calling _mm256_fmadd_pd, so this didn't fix the problem. But then I wondered, what would happen if I just used the multiply intrinsic and then used regular add operations? Like this:

  template<typename step_t>
  static Vec256<double> arange_index_offset(double base = 0., step_t step = static_cast<step_t>(1), int64_t index_offset = 0) {
    Vec256<double> index_v(index_offset, index_offset + 1, index_offset + 2, index_offset + 3); 
    Vec256<double> step_v(step, step, step, step);
    Vec256<double> tmp = _mm256_mul_pd(step_v, index_v);
    return Vec256<double>(
      base + tmp.values[0],
      base + tmp.values[1],
      base + tmp.values[2],
      base + tmp.values[3]);
  }

This worked! Since I'm using an intrinsic only for the multiply and not the add, the two operations don't get combined, and now we get the proper behavior for the example we've been looking at:

$ ATEN_CPU_CAPABILITY=avx2 python
>>> import torch
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).floor()
tensor([-5., -4., -3., -1.,  0.,  2.,  3.,  4.], dtype=torch.float64)
>>> torch.arange(-5, 5, 1.4, device='cpu', dtype=torch.float64).storage()
 -5.0
 -3.6
 -2.2
 -0.8000000000000007
 0.5999999999999996
 2.0      <---- NOTE: this is exactly 2 now, and not 1.99999... like before
 3.3999999999999986
 4.799999999999999
[torch.DoubleStorage of size 8]

This result is exactly the same for ATEN_CPU_CAPABILITY=avx2, ATEN_CPU_CAPABILITY=avx, and ATEN_CPU_CAPABILITY=default. And it agrees with the Pytorch 1.6 result as well.

I'll admit that it's a bit of an odd solution, but it does work. Is it alright if we go ahead with this?

@mruberry
Copy link
Collaborator

cc @VitalyFedyunin for the AVX build issues

@ezyang
Copy link
Contributor

ezyang commented Nov 13, 2020

@kurtamohler Thank you for the detailed analysis; in particular, narrowing it down to FMA is really useful. You didn't say so explicitly, but it makes sense that FMA would cause this problem, as it changes the rounding behavior of the operation: multiply-add does two rounds, whereas fused multiply-add only does one round. I'm guessing that in this particular case, that's enough extra precision to "witness" the fact that 1.4 isn't actually 1.4 in floating point, and give you the odd rounding behavior. In fact, arguably, the FMA result is "more correct" (because 1.4 is not actually 1.4).

All told, I'm not sure we should fix this. We compile with -fma because performance matters to us, and so we don't mind when there are slight changes in precision (after all this is deep learning cough cough). If we have a tradeoff between "results are bit-for-bit identical no matter what level of vectorization you have" versus "things might be faster at higher vectorization levels, at the cost of some slight numeric differences" (note that we are still bit-for-bit reproducible; you just have to stay at a consistent vectorization level). And it also helps us match CUDA behavior.

I'm open to other opinions though. Maybe @ngimel has more thoughts.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Nov 13, 2020

Yeah I think that's a reasonable argument @ezyang .

On the topic of better performance, I think using my Vec256::arange_index_offset instead of Vec256::arange should improve performance. I haven't actually measured it, but it does cut out one add and one multiply per call. Furthermore, we could make the operation be actually vectorized for AVX, rather than just unrolled. I'd be more than happy to do a bit of benchmarking and update my PR to introduce these performance improvements (and keep using the FMA so results won't change), just let me know if that's something you'd like me to do.

@ezyang
Copy link
Contributor

ezyang commented Nov 13, 2020

I think I agree arange_index_offset may indeed still be a good thing to do anyway, but it's mostly orthogonal to this issue report :)

@coquelin77 Coming back to you, we think this might just be "expected" behavior. Does this leave you in a bad spot, e.g., it is now not easy for you to get the right behavior one way or another?

@ezyang
Copy link
Contributor

ezyang commented Nov 13, 2020

For posterity, it IS possible to selectively disable fma: https://twitter.com/colesbury/status/1327331440992849923

@coquelin77
Copy link
Author

it make things slightly more difficult is a small subset of functions, but i think that it would be in the best interest for torch to target speed. to get exact numeric consistency while making optimization passes is very difficult, if not impossible. I posted the issue primarily for you to be aware of the changes brought on during the update

Thanks for the good work!

@ezyang
Copy link
Contributor

ezyang commented Nov 16, 2020

Thanks, I'll close this as not a bug.

BTW, for the public record, another way to implement the arange as originally quoted is to do arange on integers, and then divide by 10 after the fact; you should be able to avoid precision issues entirely that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: numerical-stability Problems related to numerical stability of operations module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants