Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cascade-summation for floats to avoid numerical instability #39516

Closed
wants to merge 13 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Jun 4, 2020

Fixes #38716, fixes #37234

This uses a variant of pairwise-summation to calculate sums of long lists without noticable loss of precison. It uses with multiple "levels" of accumulator along each axis, each of which is designed to hold the sum of an order of magnitude more values than the previous.

e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.

This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.

Ref: https://en.wikipedia.org/wiki/Pairwise_summation

@dr-ci
Copy link

dr-ci bot commented Jun 4, 2020

💊 CI failures summary and remediations

As of commit fcea51d (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_build (1/1)

Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️

Writing SSH key for checkout to id_rsa
Creating .ssh directory
Adding the following entries to known_hosts:
github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

Writing SSH key for checkout to id_rsa

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 82 times.

@peterbell10
Copy link
Collaborator Author

Here is a comparison of the relative error in the single precision sum for array lengths from 10 to a billion elements. All were done single threaded and without vectorisation to mimic the google colab results from #38716.

You can see this PR is consistently about 100x more accurate than the current implementation. The maximum relative error I got was just a 2.4% error over a billion elements, compared to a 98% error on master (i.e. completely wrong).

Errors

In code
import torch

torch.set_num_threads(1)
torch.manual_seed(0)
xs = torch.rand(100)

num_lens = 9
lens = torch.logspace(1, num_lens, num_lens).long()
errs = torch.empty((lens.numel(), xs.numel()), dtype=xs.dtype)

for i, l in enumerate(lens):
    for j, x in enumerate(xs):
        arr = torch.full((l*2,), x)[::2]
        errs[i][j] = abs(arr.sum() - x * arr.numel()) / (x * arr.numel())

error = errs.mean(axis=1)

@peterbell10 peterbell10 marked this pull request as ready for review June 10, 2020 12:02
@peterbell10 peterbell10 changed the title WIP: Use tree-based sum for floats to avoid numerical instability Use tree-based sum for floats to avoid numerical instability Jun 10, 2020
@peterbell10
Copy link
Collaborator Author

peterbell10 commented Jun 10, 2020

Benchmark results show non-vectorized sums are much faster (up to 6x) than before. However, the vectorized sums are much slower (up to 2x).

Full benchmark results below fold
# Input: R: 1024, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 123.984
Current Execution Time (us) : 51.811

# Input: R: 1024, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 3244.789
Current Execution Time (us) : 3109.085

# Input: R: 1024, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 69.474
Current Execution Time (us) : 31.434

# Input: R: 1024, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 165.043
Current Execution Time (us) : 1390.865

# Input: R: 1024, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 742.743
Current Execution Time (us) : 125.282

# Input: R: 1024, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 6538.515
Current Execution Time (us) : 5955.180

# Input: R: 1024, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 120.366
Current Execution Time (us) : 62.076

# Input: R: 1024, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 524.178
Current Execution Time (us) : 2879.336

# Input: R: 8192, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 2850.641
Current Execution Time (us) : 1636.910

# Input: R: 8192, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 23814.197
Current Execution Time (us) : 24749.496

# Input: R: 8192, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 839.180
Current Execution Time (us) : 707.876

# Input: R: 8192, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 1888.084
Current Execution Time (us) : 11735.923

# Input: R: 8192, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 6094.401
Current Execution Time (us) : 3509.463

# Input: R: 8192, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 51274.606
Current Execution Time (us) : 53312.929

# Input: R: 8192, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 1513.480
Current Execution Time (us) : 1380.813

# Input: R: 8192, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 3438.891
Current Execution Time (us) : 23206.833

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2020

Seeing if @ngimel has some comments. I'm not sure what to do about the vectorized slowdown though; 2x is a lot.

@peterbell10
Copy link
Collaborator Author

I'll see if there's anything I do to improve the vectorized performance.

@ngimel
Copy link
Collaborator

ngimel commented Jun 10, 2020

Given that sum is one of the most common functions, it will be good to see a bigger suite of benchmarks, e.g. different types, wider variety of sizes, say up to 3d tensors with reductions over pairs of dimensions, single- and multithreaded. #38338 (soon to be merged) should make benchmarking easier.

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 10, 2020
@peterbell10
Copy link
Collaborator Author

peterbell10 commented Jun 10, 2020

The benchmark I included is focused on benchmarking this PR only. Only float is changed in this PR and higher dimensions are just going to be TensorIterator choosing different combinations of non-contiguous, inner vectorized and outer vectorized reductions. That's why I focused on 2D and float here.

@peterbell10
Copy link
Collaborator Author

I've updated the vectorized_outer_sum to more closely match reduce128. The updated benchmark figures now show at worst a 10% speed regression and still show up to 6x speed improvements for some non-contiguous inputs.

Full benchmark numbers below fold
# Input: R: 1024, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 47.166
PT 1.5.0 Execution Time (us) : 46.152

# Input: R: 1024, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 2930.226
PT 1.5.0 Execution Time (us) : 2658.398

# Input: R: 1024, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 45.676
PT 1.5.0 Execution Time (us) : 44.131

# Input: R: 1024, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 195.049
PT 1.5.0 Execution Time (us) : 1081.341

# Input: R: 1024, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 95.332
PT 1.5.0 Execution Time (us) : 93.709

# Input: R: 1024, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 6161.287
PT 1.5.0 Execution Time (us) : 5741.509

# Input: R: 1024, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 87.839
PT 1.5.0 Execution Time (us) : 84.595

# Input: R: 1024, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 535.933
PT 1.5.0 Execution Time (us) : 2249.186

# Input: R: 8192, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 1435.037
PT 1.5.0 Execution Time (us) : 1460.735

# Input: R: 8192, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 36749.245
PT 1.5.0 Execution Time (us) : 36095.155

# Input: R: 8192, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 1086.657
PT 1.5.0 Execution Time (us) : 1029.092

# Input: R: 8192, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 2194.304
PT 1.5.0 Execution Time (us) : 8854.416

# Input: R: 8192, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 4121.247
PT 1.5.0 Execution Time (us) : 4166.985

# Input: R: 8192, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 78065.462
PT 1.5.0 Execution Time (us) : 78299.460

# Input: R: 8192, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 2154.252
PT 1.5.0 Execution Time (us) : 2145.106

# Input: R: 8192, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 4428.362
PT 1.5.0 Execution Time (us) : 17697.258

A minor refactoring also revealed a flaw in how I was using the first level accumulator. Fixing this gave a further huge improvement in accuracy of large arrays:

Errors

@peterbell10
Copy link
Collaborator Author

From the latest commit, non-vectorized cases are 3-6x faster than pytorch 1.5. The vectorised cases are all measured the same, to within a 1-2% margin of error.

Graphs

Quick
Medium
Slow

@ngimel
Copy link
Collaborator

ngimel commented Jun 12, 2020

That's awesome! Do you think more comprehensive correctness tests are needed, especially for reductions over tuples of dimensions (they can't always be coalesced into reduction over a single dimension)? sum is one of the most common operations, and I'm not at all sure that our test suite covers all there is to cover.

return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1;
}

// Simultaneously sum over n rows at once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find it helpful when there's pseudocode for the "naive" version of this algorithm around, it helps me understand how the input values are put together. In this case it should be pretty simple, right?

return;
}

// Custom floating point sum for better accuracy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a high level description of what the algorithm does, similar to what you've already in the PR description, in the source code here, to make it easier for people to figure out what is going on.

partial_sums[0] += load<scalar_t>(in_data, in_stride, i);
}

for (int64_t k = 1; k < ilp_factor; ++k) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC: how come you don't need an unroll here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mainly unrolled operations I expect to execute in parallel on the CPU hardware. Since there is a data dependency between each trip of the loop, I don't expect this to pipeline well in hardware.

template <typename scalar_t>
scalar_t row_sum(const char * C10_RESTRICT in_data,
const int64_t in_stride, const int64_t size) {
constexpr int64_t ilp_factor = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice row_sum is called with both a non-vectorized scalar type, and a vectorized scalar type. Does the ILP factor matter when you're not vectorizing? Why or why not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, vectorised and scalar additions have roughly the same characteristics in terms of register pressure and instruction latency. Both can be pipelined in the CPU hardware to achieve greater parallelism. I expect this is the main reason why my implementation is so much faster in non-vectorised cases, since there was no ILP before.

@ezyang
Copy link
Contributor

ezyang commented Jun 15, 2020

I'm not qualified to review the reduction algorithm here. @colesbury / @ngimel do you think either of you would be able to do a detailed review? The new algorithm in SumKernel.cpp seems to be quite involved.

@peterbell10 it might be easier to review if we can get a more detailed algorithmic description, expanding beyond the PR description. Some things I'd be interested in seeing:

  1. How closely related is your new sum_kernel_impl to the old binary_kernel_reduce_vec? I see a number of parallels, such as the treatment of inner/outer reductions separately. Would it be accurate to say we've preserved the high level structure of binary_kernel_reduce_vec entirely, but replaced binary_kernel_reduce_vec?
  2. You seem to reuse code between the vectorized and non-vectorized implementations. How does this work? Do we get the expected behavior in the non-vectorized case?
  3. A diagram showing how the inner / outer reductions operate on the data (with rows involved) I think would be really useful for someone trying to understand how the algorithm works in general.

@peterbell10
Copy link
Collaborator Author

I've added a more detailed description of the sum algorithm in a comment on multi_row_sum. I will work on a diagram of the multidimensional sum structure but it is more or less the same as the vectorised paths in binary_kernel_reduce_vec.

In fact, I could probably reuse a lot more of the existing reduction machinery if it was templated on a multi_row_reduce operation instead of binary ops. I could work on that change here, or in another PR if we want to remove the duplication.

aten/src/ATen/native/cpu/SumKernel.cpp Outdated Show resolved Hide resolved
auto ux = static_cast<uint64_t>(x);
// Last set bit is floor(log2(x)), floor + 1 is ceil
// except when x is an exact powers of 2, so subtract 1 first
return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I did not know llvm has it and we have it available.

int64_t out_strides[] = { strides[0], strides[2] };

// Move reduction to be the 1st dim
if (out_strides[0] != 0 && out_strides[1] == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought moving reduction dimensions to the beginning is done by TensorIterator, is it not the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to always happen. For example,

torch.rand(100, 1).sum(1)

Gives me strides 4, 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmhm, that's probably because when you are reducing over a dimension of size 1, is_dim_reduced does not identify it as reduction dimension.

bool TensorIterator::is_dim_reduced(int dim) const {
for (auto& op : operands_) {
if (op.is_output && op.stride_bytes[dim] == 0 && shape_[dim] > 1) {
return true;
}
}
return false;
}

TensorIterator logic around reductions is pretty confusing :-(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the basic_loop version of the reduction, that would give a better memory access pattern over the input vector. I'm guess that's why it's done this way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dug a bit deeper, and it turns out in this case dimensions are coalesced in TI, so instead of (100,1) tensor you have (100) tensor. Then bogus (0,0) strides are pushed back to the end of the strides vector to fill in, and a bogus size1=1 is sent to the loop. Depending on whether input tensor is contiguous or not, vectorized_outer_sum or scalar_inner_sum will be called, both will produce correct results. Still don't know how out_strides can both be non-zero, but maybe it happens when out kwarg is specified and for some reason strides for dimension equal to 1 are not reset (their values don't matter).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a minimal reproducer for the non-zero strided reduction. It also relies on reduction over a length-1 dimension:

torch.rand(2).expand(1, 3, 2).sum(0)

}

// Special case? - not a true reduction
if (out_strides[0] != 0 && out_strides[1] != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure but it comes up in one of the test_autograd.py test cases.

@ngimel
Copy link
Collaborator

ngimel commented Jun 15, 2020

I think it makes sense to remove duplication, but you can work on it in another PR. It would also be great to switch operations that currently use non-vectorized reduction kernel (norm, std etc) to your implementation, because non-vectorized reduction performance is pretty bad.

@peterbell10
Copy link
Collaborator Author

I have made this diagram, hopefully this makes it clearer:

mulri_row_reduction

@ngimel
Copy link
Collaborator

ngimel commented Jun 17, 2020

I think this should be good to go, although replacing the most common reduction is scary :-)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@peterbell10 peterbell10 changed the title Use tree-based sum for floats to avoid numerical instability Use cascade-summationt for floats to avoid numerical instability Jun 23, 2020
@peterbell10 peterbell10 changed the title Use cascade-summationt for floats to avoid numerical instability Use cascade-summation for floats to avoid numerical instability Jun 23, 2020
@peterbell10
Copy link
Collaborator Author

I've learned this algorithm has an established name, pairwise-summation or cascade-summation which lead to numpy/numpy#3685. So, it seems NumPy actually uses this exact algorithm already with slightly different parameters.

@ngimel
Copy link
Collaborator

ngimel commented Jun 24, 2020

@peterbell10 FYI, one of captum tests is failing with this PR, to repro, after installing captum go to tests/attr folder and run

pytest -v -s test_feature_permutation.py -k test_single_input

In all likelihood, the problem is not with this PR, and rather with captum tests relying on fp arithmetics that cannot be relied on, but I can't land with a failing test. I'm looking at the test in more detail.
Edit: yep, it's the test that for some reason expects that the sum of array with permuted elements will be different from the sum of original array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
4 participants