torch.norm on CPU gives incorrect results for large Tensors #15602

soumith · 2018-12-29T01:57:02Z

🐛 Bug

>>> import torch
>>> x = torch.ones(40000)
>>> torch.norm(x)
tensor(266.0605)
>>> torch.norm(x.to(dtype=torch.float32, device='cuda'))
tensor(200., device='cuda:0')

Originally reported at https://discuss.pytorch.org/t/output-of-torch-norm-x-depends-on-size-of-x-not-in-a-good-way/33299/8

cpuhrsch · 2019-01-09T19:40:10Z

cc @xhzhao - I was able to bisect this down to 99c0b96 - do you have time to also take a look?

The script I used to bisect this is

python -c "import torch; import sys; sys.exit(0) if ((2 * torch.norm(torch.ones(10000))) == torch.norm(torch.ones(40000))) else sys.exit(1)"

For now I'll write a PR to revert the optimizations

cpuhrsch · 2019-01-09T19:41:38Z

cc @colesbury for ReduceOps

cpuhrsch · 2019-01-09T20:55:24Z

@xhzhao - I think you're using at::parallel_reduce incorrectly. Parallel_reduce takes a function that allows you to do a reduction over a section of the input and a function that is able to reduce partial results.

For example, you might have a tensor of 10000 entires and want to sum together all the elements. Parallel_reduce with a grain_size of 2500 will then allocate an intermediate result tensor with 4 elements. Then it will execute the function "f" you provide on each of these chunks of 2500 values, so 0-24999, 2500-4999, etc. It will write out the result from each of these chunks into the intermediate result tensor. After that it'll reduce the partial results from each chunk into a single number using the scalar function sf, which for sum would be "+". This is similar to tbb's approach, where you need to provide a function to accumulate a subrange, a function to combine two partial results and an identity.

In the norm kernel code you also pass this plus operator to combine two partial results. That means, when passing a tensor of 40000 1s, as in the script we use to reproduce this, for the default grain size, parallel_reduce will split the tensor into two chunks 0-32767 and 32768-39999. Your provided partial reduction function then returns the norm for each of these chunks which is 181.019 and 85.0412 respectively. However, combining those two values using add does not give you the correct overall results. You need to specify a different per-scalar combination function such as squaring, summing and then taking the square root again (for the 2-norm), but this will lead to numerical instability depending on the values and number of chunks.

…5885) Summary: See #15602 Pull Request resolved: #15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf

…5885) Summary: See pytorch/pytorch#15602 Pull Request resolved: pytorch/pytorch#15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf

soumith · 2019-01-16T23:23:19Z

closed via #15885

…torch#15885) Summary: See pytorch#15602 Pull Request resolved: pytorch#15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf

…5885) Summary: See #15602 Pull Request resolved: #15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf

soumith added this to the 1.0.1 milestone Dec 29, 2018

soumith added bug high priority labels Dec 29, 2018

mrshenli mentioned this issue Jan 1, 2019

torch.prod on GPU produces incorrect results on large Tensors #15647

Closed

ailzhang assigned cpuhrsch Jan 8, 2019

cpuhrsch mentioned this issue Jan 9, 2019

Undo norm optimizations and add more documentation for parallel.h #15885

Closed

soumith mentioned this issue Jan 16, 2019

CTCLoss cuda error #16093

Closed

soumith closed this as completed Jan 16, 2019

cpuhrsch mentioned this issue Jan 17, 2019

Undo norm optimizations and add more documentation for parallel.h (#1… #16126

Closed

soumith added the cherry-picked This PR was cherry-picked onto a release branch from master label Jan 17, 2019

sneheshs mentioned this issue Aug 29, 2019

The very first example doesnt work SungjoonPark/FactorRotation#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.norm on CPU gives incorrect results for large Tensors #15602

torch.norm on CPU gives incorrect results for large Tensors #15602

soumith commented Dec 29, 2018

cpuhrsch commented Jan 9, 2019

cpuhrsch commented Jan 9, 2019

cpuhrsch commented Jan 9, 2019 •

edited

Loading

soumith commented Jan 16, 2019

torch.norm on CPU gives incorrect results for large Tensors #15602

torch.norm on CPU gives incorrect results for large Tensors #15602

Comments

soumith commented Dec 29, 2018

🐛 Bug

cpuhrsch commented Jan 9, 2019

cpuhrsch commented Jan 9, 2019

cpuhrsch commented Jan 9, 2019 • edited Loading

soumith commented Jan 16, 2019

cpuhrsch commented Jan 9, 2019 •

edited

Loading