-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cascade-summation for floats to avoid numerical instability #39516
Conversation
💊 CI failures summary and remediationsAs of commit fcea51d (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm: pytorch_windows_vs2019_py36_cuda10.1_build (1/1)Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️
|
Here is a comparison of the relative error in the single precision sum for array lengths from 10 to a billion elements. All were done single threaded and without vectorisation to mimic the google colab results from #38716. You can see this PR is consistently about 100x more accurate than the current implementation. The maximum relative error I got was just a 2.4% error over a billion elements, compared to a 98% error on master (i.e. completely wrong). In code
|
Benchmark results show non-vectorized sums are much faster (up to 6x) than before. However, the vectorized sums are much slower (up to 2x). Full benchmark results below fold
|
Seeing if @ngimel has some comments. I'm not sure what to do about the vectorized slowdown though; 2x is a lot. |
I'll see if there's anything I do to improve the vectorized performance. |
Given that sum is one of the most common functions, it will be good to see a bigger suite of benchmarks, e.g. different types, wider variety of sizes, say up to 3d tensors with reductions over pairs of dimensions, single- and multithreaded. #38338 (soon to be merged) should make benchmarking easier. |
The benchmark I included is focused on benchmarking this PR only. Only float is changed in this PR and higher dimensions are just going to be |
I've updated the Full benchmark numbers below fold
A minor refactoring also revealed a flaw in how I was using the first level accumulator. Fixing this gave a further huge improvement in accuracy of large arrays: |
That's awesome! Do you think more comprehensive correctness tests are needed, especially for reductions over tuples of dimensions (they can't always be coalesced into reduction over a single dimension)? |
This is a direct analog of reduce128, but with the tree accumulator
return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1; | ||
} | ||
|
||
// Simultaneously sum over n rows at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find it helpful when there's pseudocode for the "naive" version of this algorithm around, it helps me understand how the input values are put together. In this case it should be pretty simple, right?
return; | ||
} | ||
|
||
// Custom floating point sum for better accuracy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a high level description of what the algorithm does, similar to what you've already in the PR description, in the source code here, to make it easier for people to figure out what is going on.
partial_sums[0] += load<scalar_t>(in_data, in_stride, i); | ||
} | ||
|
||
for (int64_t k = 1; k < ilp_factor; ++k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC: how come you don't need an unroll here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mainly unrolled operations I expect to execute in parallel on the CPU hardware. Since there is a data dependency between each trip of the loop, I don't expect this to pipeline well in hardware.
template <typename scalar_t> | ||
scalar_t row_sum(const char * C10_RESTRICT in_data, | ||
const int64_t in_stride, const int64_t size) { | ||
constexpr int64_t ilp_factor = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice row_sum
is called with both a non-vectorized scalar type, and a vectorized scalar type. Does the ILP factor matter when you're not vectorizing? Why or why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my knowledge, vectorised and scalar additions have roughly the same characteristics in terms of register pressure and instruction latency. Both can be pipelined in the CPU hardware to achieve greater parallelism. I expect this is the main reason why my implementation is so much faster in non-vectorised cases, since there was no ILP before.
I'm not qualified to review the reduction algorithm here. @colesbury / @ngimel do you think either of you would be able to do a detailed review? The new algorithm in SumKernel.cpp seems to be quite involved. @peterbell10 it might be easier to review if we can get a more detailed algorithmic description, expanding beyond the PR description. Some things I'd be interested in seeing:
|
I've added a more detailed description of the sum algorithm in a comment on In fact, I could probably reuse a lot more of the existing reduction machinery if it was templated on a |
auto ux = static_cast<uint64_t>(x); | ||
// Last set bit is floor(log2(x)), floor + 1 is ceil | ||
// except when x is an exact powers of 2, so subtract 1 first | ||
return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I did not know llvm has it and we have it available.
int64_t out_strides[] = { strides[0], strides[2] }; | ||
|
||
// Move reduction to be the 1st dim | ||
if (out_strides[0] != 0 && out_strides[1] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought moving reduction dimensions to the beginning is done by TensorIterator, is it not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to always happen. For example,
torch.rand(100, 1).sum(1)
Gives me strides 4, 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmhm, that's probably because when you are reducing over a dimension of size 1, is_dim_reduced does not identify it as reduction dimension.
pytorch/aten/src/ATen/native/TensorIterator.cpp
Lines 472 to 479 in 4c7d81f
bool TensorIterator::is_dim_reduced(int dim) const { | |
for (auto& op : operands_) { | |
if (op.is_output && op.stride_bytes[dim] == 0 && shape_[dim] > 1) { | |
return true; | |
} | |
} | |
return false; | |
} |
TensorIterator logic around reductions is pretty confusing :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the basic_loop
version of the reduction, that would give a better memory access pattern over the input vector. I'm guess that's why it's done this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dug a bit deeper, and it turns out in this case dimensions are coalesced in TI, so instead of (100,1) tensor you have (100) tensor. Then bogus (0,0) strides are pushed back to the end of the strides vector to fill in, and a bogus size1=1 is sent to the loop. Depending on whether input tensor is contiguous or not, vectorized_outer_sum or scalar_inner_sum will be called, both will produce correct results. Still don't know how out_strides can both be non-zero, but maybe it happens when out
kwarg is specified and for some reason strides for dimension equal to 1 are not reset (their values don't matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a minimal reproducer for the non-zero strided reduction. It also relies on reduction over a length-1 dimension:
torch.rand(2).expand(1, 3, 2).sum(0)
} | ||
|
||
// Special case? - not a true reduction | ||
if (out_strides[0] != 0 && out_strides[1] != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure but it comes up in one of the test_autograd.py
test cases.
I think it makes sense to remove duplication, but you can work on it in another PR. It would also be great to switch operations that currently use non-vectorized reduction kernel (norm, std etc) to your implementation, because non-vectorized reduction performance is pretty bad. |
I think this should be good to go, although replacing the most common reduction is scary :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I've learned this algorithm has an established name, pairwise-summation or cascade-summation which lead to numpy/numpy#3685. So, it seems NumPy actually uses this exact algorithm already with slightly different parameters. |
@peterbell10 FYI, one of captum tests is failing with this PR, to repro, after installing captum go to tests/attr folder and run
In all likelihood, the problem is not with this PR, and rather with captum tests relying on fp arithmetics that cannot be relied on, but I can't land with a failing test. I'm looking at the test in more detail. |
Fixes #38716, fixes #37234
This uses a variant of pairwise-summation to calculate sums of long lists without noticable loss of precison. It uses with multiple "levels" of accumulator along each axis, each of which is designed to hold the sum of an order of magnitude more values than the previous.
e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.
This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.
Ref: https://en.wikipedia.org/wiki/Pairwise_summation