torch.norm produces incorrect results #20551

ecvgit · 2019-05-15T21:28:51Z

🐛 Bug

torch.norm gives incorrect results on CPU in the latest nightly build as well as in 1.1.0 stable.

To Reproduce


>>>  import torch
>>> a=torch.rand(2000,2000,64)
>>> b=torch.norm(a)
>>> c=torch.norm(a.cuda())
>>> b,c
(tensor(5792.6187), tensor(9237.8311, device='cuda:0'))

Expected behavior

Both b and c should have the same values.

Environment

PyTorch version: 1.1.0.dev20190514
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Red Hat Enterprise Linux Server release 7.4 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
CMake version: version 2.8.12.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla K40m
GPU 1: Tesla K40m

Nvidia driver version: 387.26
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.1
[pip3] numpy==1.14.3
[pip3] torch==0.4.0
[pip3] torchtext==0.2.3
[pip3] torchvision==0.2.1
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] pytorch-nightly 1.1.0.dev20190514 py3.6_cuda9.0.176_cudnn7.5.1_0 pytorch
[conda] torchtext 0.2.3

Additional context

The text was updated successfully, but these errors were encountered:

soumith · 2019-05-15T22:05:14Z

i think this is float precision issues.

in float64, it seems to be working fine:

In [1]: import torch

In [2]: a=torch.rand(2000,2000,64)
a[0
In [3]: a[0][0][0]
Out[3]: tensor(0.4834)

In [4]: a=torch.rand(2000,2000,64, dtype=torch.float64)

In [5]: b=torch.norm(a)

In [6]: c=torch.norm(a.cuda())

In [7]: b, c
Out[7]:
(tensor(9237.5918, dtype=torch.float64),
 tensor(9237.5918, device='cuda:0', dtype=torch.float64))

soumith · 2019-05-15T22:06:27Z

cc: @umanwizard to double-check, i think you made norm TensorIterator compatible if I remember (sorry if it wasn't you)

umanwizard · 2019-05-15T22:09:48Z

No, it was done by @jjsjann123 in #15414

ecvgit · 2019-05-15T22:49:25Z

The code above gives correct results in pytorch 0.4, but not in pytorch 1.1.0. If it is a float precision issue, not sure why it works correctly in pytorch 0.4

colesbury · 2019-05-16T02:42:17Z

If it is a float precision issue, not sure why it works correctly in pytorch 0.4

PyTorch 0.4 did the accumulation using double https://github.com/pytorch/pytorch/blob/v0.4.1/aten/src/TH/generic/THTensorMath.cpp#L4307

Now it's using float accumulation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L57

CUDA uses float accumulation, but is saved because the necessary parallelism forces a form of pairwise summation. We should probably do the same thing for CPU.

jjsjann123 · 2019-05-16T03:05:54Z

Like Sam mentioned, numerical behaviors are very differently on CPU/GPU because of the level of parallelism. Sacrifices like this (using double instead of float) is necessary for CPU, but it is not on CUDA code and would hurt performance.
We probably could plumb some logic through so we use different accumulation on each device. But really the shared code between CPU/GPU kernels are not significant. It might be a easier to just keep CPU/GPU kernels separate.

colesbury · 2019-05-16T17:52:33Z

@jjsjann123 I was suggesting something different: that we use pairwise summation on the CPU in reduction kernels -- not that we switch to double accumulation.

mruberry · 2020-10-10T10:02:19Z

This is still an issue.

foreverlms · 2024-05-17T08:35:35Z

This still exists. Anyone take a look? or change the aten operator of torch cpu backend.

vishwakftw added module: operators module: numerical-reproducibility labels May 16, 2019

li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2019

fmassa mentioned this issue May 28, 2019

CPU torch.norm gives strange results for LargeTensor on Colab #20999

Closed

denproc mentioned this issue Jun 17, 2020

Bugfix for FID photosynthesis-team/piq#103

Merged

mruberry added module: norms and normalization and removed module: operators (deprecated) labels Oct 10, 2020

BrambleXu mentioned this issue Mar 2, 2022

Fix ci allclose error tech-sketch/SeqAL#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.norm produces incorrect results #20551

torch.norm produces incorrect results #20551

ecvgit commented May 15, 2019 •

edited

soumith commented May 15, 2019

soumith commented May 15, 2019

umanwizard commented May 15, 2019

ecvgit commented May 15, 2019 •

edited

colesbury commented May 16, 2019

jjsjann123 commented May 16, 2019

colesbury commented May 16, 2019

mruberry commented Oct 10, 2020

foreverlms commented May 17, 2024

torch.norm produces incorrect results #20551

torch.norm produces incorrect results #20551

Comments

ecvgit commented May 15, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

soumith commented May 15, 2019

soumith commented May 15, 2019

umanwizard commented May 15, 2019

ecvgit commented May 15, 2019 • edited

colesbury commented May 16, 2019

jjsjann123 commented May 16, 2019

colesbury commented May 16, 2019

mruberry commented Oct 10, 2020

foreverlms commented May 17, 2024

ecvgit commented May 15, 2019 •

edited

ecvgit commented May 15, 2019 •

edited