-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.norm on CPU gives incorrect results for large Tensors #15602
Comments
cc @xhzhao - I was able to bisect this down to 99c0b96 - do you have time to also take a look? The script I used to bisect this is
For now I'll write a PR to revert the optimizations |
cc @colesbury for ReduceOps |
@xhzhao - I think you're using at::parallel_reduce incorrectly. Parallel_reduce takes a function that allows you to do a reduction over a section of the input and a function that is able to reduce partial results. For example, you might have a tensor of 10000 entires and want to sum together all the elements. Parallel_reduce with a grain_size of 2500 will then allocate an intermediate result tensor with 4 elements. Then it will execute the function "f" you provide on each of these chunks of 2500 values, so 0-24999, 2500-4999, etc. It will write out the result from each of these chunks into the intermediate result tensor. After that it'll reduce the partial results from each chunk into a single number using the scalar function sf, which for sum would be "+". This is similar to tbb's approach, where you need to provide a function to accumulate a subrange, a function to combine two partial results and an identity. In the norm kernel code you also pass this plus operator to combine two partial results. That means, when passing a tensor of 40000 1s, as in the script we use to reproduce this, for the default grain size, parallel_reduce will split the tensor into two chunks 0-32767 and 32768-39999. Your provided partial reduction function then returns the norm for each of these chunks which is 181.019 and 85.0412 respectively. However, combining those two values using add does not give you the correct overall results. You need to specify a different per-scalar combination function such as squaring, summing and then taking the square root again (for the 2-norm), but this will lead to numerical instability depending on the values and number of chunks. |
…5885) Summary: See pytorch/pytorch#15602 Pull Request resolved: pytorch/pytorch#15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf
closed via #15885 |
…torch#15885) Summary: See pytorch#15602 Pull Request resolved: pytorch#15885 Differential Revision: D13614841 Pulled By: cpuhrsch fbshipit-source-id: 5d3e45f499d36ac287dbbc2e45798aa51eb5bfdf
🐛 Bug
Originally reported at https://discuss.pytorch.org/t/output-of-torch-norm-x-depends-on-size-of-x-not-in-a-good-way/33299/8
The text was updated successfully, but these errors were encountered: