Use cascade-summation for floats to avoid numerical instability #39516

peterbell10 · 2020-06-04T10:55:51Z

This uses a variant of pairwise-summation to calculate sums of long lists without noticable loss of precison. It uses with multiple "levels" of accumulator along each axis, each of which is designed to hold the sum of an order of magnitude more values than the previous.

e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.

This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.

Ref: https://en.wikipedia.org/wiki/Pairwise_summation

dr-ci · 2020-06-04T11:27:00Z

💊 CI failures summary and remediations

As of commit fcea51d (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_windows_vs2019_py36_cuda10.1_build (1/1)

Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️

Writing SSH key for checkout to id_rsa

Creating .ssh directory
Adding the following entries to known_hosts:
github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

Writing SSH key for checkout to id_rsa

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 82 times.

peterbell10 · 2020-06-10T12:02:43Z

Here is a comparison of the relative error in the single precision sum for array lengths from 10 to a billion elements. All were done single threaded and without vectorisation to mimic the google colab results from #38716.

You can see this PR is consistently about 100x more accurate than the current implementation. The maximum relative error I got was just a 2.4% error over a billion elements, compared to a 98% error on master (i.e. completely wrong).

In code

import torch

torch.set_num_threads(1)
torch.manual_seed(0)
xs = torch.rand(100)

num_lens = 9
lens = torch.logspace(1, num_lens, num_lens).long()
errs = torch.empty((lens.numel(), xs.numel()), dtype=xs.dtype)

for i, l in enumerate(lens):
    for j, x in enumerate(xs):
        arr = torch.full((l*2,), x)[::2]
        errs[i][j] = abs(arr.sum() - x * arr.numel()) / (x * arr.numel())

error = errs.mean(axis=1)

peterbell10 · 2020-06-10T14:07:11Z

Benchmark results show non-vectorized sums are much faster (up to 6x) than before. However, the vectorized sums are much slower (up to 2x).

Full benchmark results below fold

# Input: R: 1024, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 123.984
Current Execution Time (us) : 51.811

# Input: R: 1024, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 3244.789
Current Execution Time (us) : 3109.085

# Input: R: 1024, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 69.474
Current Execution Time (us) : 31.434

# Input: R: 1024, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 165.043
Current Execution Time (us) : 1390.865

# Input: R: 1024, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 742.743
Current Execution Time (us) : 125.282

# Input: R: 1024, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 6538.515
Current Execution Time (us) : 5955.180

# Input: R: 1024, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 120.366
Current Execution Time (us) : 62.076

# Input: R: 1024, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 524.178
Current Execution Time (us) : 2879.336

# Input: R: 8192, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 2850.641
Current Execution Time (us) : 1636.910

# Input: R: 8192, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 23814.197
Current Execution Time (us) : 24749.496

# Input: R: 8192, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 839.180
Current Execution Time (us) : 707.876

# Input: R: 8192, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 1888.084
Current Execution Time (us) : 11735.923

# Input: R: 8192, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 6094.401
Current Execution Time (us) : 3509.463

# Input: R: 8192, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 51274.606
Current Execution Time (us) : 53312.929

# Input: R: 8192, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 1513.480
Current Execution Time (us) : 1380.813

# Input: R: 8192, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 3438.891
Current Execution Time (us) : 23206.833

ezyang · 2020-06-10T14:53:06Z

Seeing if @ngimel has some comments. I'm not sure what to do about the vectorized slowdown though; 2x is a lot.

peterbell10 · 2020-06-10T15:18:12Z

I'll see if there's anything I do to improve the vectorized performance.

ngimel · 2020-06-10T16:39:16Z

Given that sum is one of the most common functions, it will be good to see a bigger suite of benchmarks, e.g. different types, wider variety of sizes, say up to 3d tensors with reductions over pairs of dimensions, single- and multithreaded. #38338 (soon to be merged) should make benchmarking easier.

peterbell10 · 2020-06-10T17:18:27Z

The benchmark I included is focused on benchmarking this PR only. Only float is changed in this PR and higher dimensions are just going to be TensorIterator choosing different combinations of non-contiguous, inner vectorized and outer vectorized reductions. That's why I focused on 2D and float here.

peterbell10 · 2020-06-11T13:38:57Z

I've updated the vectorized_outer_sum to more closely match reduce128. The updated benchmark figures now show at worst a 10% speed regression and still show up to 6x speed improvements for some non-contiguous inputs.

Full benchmark numbers below fold

# Input: R: 1024, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 47.166
PT 1.5.0 Execution Time (us) : 46.152

# Input: R: 1024, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 2930.226
PT 1.5.0 Execution Time (us) : 2658.398

# Input: R: 1024, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 45.676
PT 1.5.0 Execution Time (us) : 44.131

# Input: R: 1024, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 195.049
PT 1.5.0 Execution Time (us) : 1081.341

# Input: R: 1024, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 95.332
PT 1.5.0 Execution Time (us) : 93.709

# Input: R: 1024, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 6161.287
PT 1.5.0 Execution Time (us) : 5741.509

# Input: R: 1024, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 87.839
PT 1.5.0 Execution Time (us) : 84.595

# Input: R: 1024, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 535.933
PT 1.5.0 Execution Time (us) : 2249.186

# Input: R: 8192, V: 512, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 1435.037
PT 1.5.0 Execution Time (us) : 1460.735

# Input: R: 8192, V: 512, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 36749.245
PT 1.5.0 Execution Time (us) : 36095.155

# Input: R: 8192, V: 512, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 1086.657
PT 1.5.0 Execution Time (us) : 1029.092

# Input: R: 8192, V: 512, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 2194.304
PT 1.5.0 Execution Time (us) : 8854.416

# Input: R: 8192, V: 1024, dim: 0, contiguous: True, device: cpu
This PR Execution Time (us) : 4121.247
PT 1.5.0 Execution Time (us) : 4166.985

# Input: R: 8192, V: 1024, dim: 0, contiguous: False, device: cpu
This PR Execution Time (us) : 78065.462
PT 1.5.0 Execution Time (us) : 78299.460

# Input: R: 8192, V: 1024, dim: 1, contiguous: True, device: cpu
This PR Execution Time (us) : 2154.252
PT 1.5.0 Execution Time (us) : 2145.106

# Input: R: 8192, V: 1024, dim: 1, contiguous: False, device: cpu
This PR Execution Time (us) : 4428.362
PT 1.5.0 Execution Time (us) : 17697.258

A minor refactoring also revealed a flaw in how I was using the first level accumulator. Fixing this gave a further huge improvement in accuracy of large arrays:

peterbell10 · 2020-06-12T19:38:56Z

From the latest commit, non-vectorized cases are 3-6x faster than pytorch 1.5. The vectorised cases are all measured the same, to within a 1-2% margin of error.

Graphs

ngimel · 2020-06-12T21:29:21Z

That's awesome! Do you think more comprehensive correctness tests are needed, especially for reductions over tuples of dimensions (they can't always be coalesced into reduction over a single dimension)? sum is one of the most common operations, and I'm not at all sure that our test suite covers all there is to cover.

This is a direct analog of reduce128, but with the tree accumulator

ezyang · 2020-06-15T15:00:55Z

aten/src/ATen/native/cpu/SumKernel.cpp

+  return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1;
+}
+
+// Simultaneously sum over n rows at once


I personally find it helpful when there's pseudocode for the "naive" version of this algorithm around, it helps me understand how the input values are put together. In this case it should be pretty simple, right?

ezyang · 2020-06-15T15:05:34Z

aten/src/ATen/native/cpu/SumKernel.cpp

+    return;
+  }
+
+  // Custom floating point sum for better accuracy


There should be a high level description of what the algorithm does, similar to what you've already in the PR description, in the source code here, to make it easier for people to figure out what is going on.

ezyang · 2020-06-15T15:07:21Z

aten/src/ATen/native/cpu/SumKernel.cpp

+    partial_sums[0] += load<scalar_t>(in_data, in_stride, i);
+  }
+
+  for (int64_t k = 1; k < ilp_factor; ++k) {


OOC: how come you don't need an unroll here?

I have mainly unrolled operations I expect to execute in parallel on the CPU hardware. Since there is a data dependency between each trip of the loop, I don't expect this to pipeline well in hardware.

ezyang · 2020-06-15T15:09:39Z

aten/src/ATen/native/cpu/SumKernel.cpp

+template <typename scalar_t>
+scalar_t row_sum(const char * C10_RESTRICT in_data,
+                 const int64_t in_stride, const int64_t size) {
+  constexpr int64_t ilp_factor = 4;


I notice row_sum is called with both a non-vectorized scalar type, and a vectorized scalar type. Does the ILP factor matter when you're not vectorizing? Why or why not?

To the best of my knowledge, vectorised and scalar additions have roughly the same characteristics in terms of register pressure and instruction latency. Both can be pipelined in the CPU hardware to achieve greater parallelism. I expect this is the main reason why my implementation is so much faster in non-vectorised cases, since there was no ILP before.

ezyang · 2020-06-15T15:12:18Z

I'm not qualified to review the reduction algorithm here. @colesbury / @ngimel do you think either of you would be able to do a detailed review? The new algorithm in SumKernel.cpp seems to be quite involved.

@peterbell10 it might be easier to review if we can get a more detailed algorithmic description, expanding beyond the PR description. Some things I'd be interested in seeing:

How closely related is your new sum_kernel_impl to the old binary_kernel_reduce_vec? I see a number of parallels, such as the treatment of inner/outer reductions separately. Would it be accurate to say we've preserved the high level structure of binary_kernel_reduce_vec entirely, but replaced binary_kernel_reduce_vec?
You seem to reuse code between the vectorized and non-vectorized implementations. How does this work? Do we get the expected behavior in the non-vectorized case?
A diagram showing how the inner / outer reductions operate on the data (with rows involved) I think would be really useful for someone trying to understand how the algorithm works in general.

peterbell10 · 2020-06-15T15:54:43Z

I've added a more detailed description of the sum algorithm in a comment on multi_row_sum. I will work on a diagram of the multidimensional sum structure but it is more or less the same as the vectorised paths in binary_kernel_reduce_vec.

In fact, I could probably reuse a lot more of the existing reduction machinery if it was templated on a multi_row_reduce operation instead of binary ops. I could work on that change here, or in another PR if we want to remove the duplication.

aten/src/ATen/native/cpu/SumKernel.cpp

ngimel · 2020-06-15T01:57:55Z

aten/src/ATen/native/cpu/SumKernel.cpp

+  auto ux = static_cast<uint64_t>(x);
+  // Last set bit is floor(log2(x)), floor + 1 is ceil
+  // except when x is an exact powers of 2, so subtract 1 first
+  return static_cast<int64_t>(llvm::findLastSet(ux - 1)) + 1;


nice! I did not know llvm has it and we have it available.

ngimel · 2020-06-15T03:38:34Z

aten/src/ATen/native/cpu/SumKernel.cpp

+          int64_t out_strides[] = { strides[0], strides[2] };
+
+          // Move reduction to be the 1st dim
+          if (out_strides[0] != 0 && out_strides[1] == 0) {


I thought moving reduction dimensions to the beginning is done by TensorIterator, is it not the case?

This doesn't seem to always happen. For example,

torch.rand(100, 1).sum(1)

Gives me strides 4, 0.

hmhm, that's probably because when you are reducing over a dimension of size 1, is_dim_reduced does not identify it as reduction dimension.

pytorch/aten/src/ATen/native/TensorIterator.cpp

Lines 472 to 479 in 4c7d81f

bool TensorIterator::is_dim_reduced(int dim) const {

for (auto& op : operands_) {

if (op.is_output && op.stride_bytes[dim] == 0 && shape_[dim] > 1) {

return true;

}

}

return false;

}

TensorIterator logic around reductions is pretty confusing :-(

For the basic_loop version of the reduction, that would give a better memory access pattern over the input vector. I'm guess that's why it's done this way?

I've dug a bit deeper, and it turns out in this case dimensions are coalesced in TI, so instead of (100,1) tensor you have (100) tensor. Then bogus (0,0) strides are pushed back to the end of the strides vector to fill in, and a bogus size1=1 is sent to the loop. Depending on whether input tensor is contiguous or not, vectorized_outer_sum or scalar_inner_sum will be called, both will produce correct results. Still don't know how out_strides can both be non-zero, but maybe it happens when out kwarg is specified and for some reason strides for dimension equal to 1 are not reset (their values don't matter).

I found a minimal reproducer for the non-zero strided reduction. It also relies on reduction over a length-1 dimension:

torch.rand(2).expand(1, 3, 2).sum(0)

ngimel · 2020-06-15T04:01:30Z

aten/src/ATen/native/cpu/SumKernel.cpp

+          }
+
+          // Special case? - not a true reduction
+          if (out_strides[0] != 0 && out_strides[1] != 0) {


How can this happen?

I'm not entirely sure but it comes up in one of the test_autograd.py test cases.

ngimel · 2020-06-15T16:25:08Z

I think it makes sense to remove duplication, but you can work on it in another PR. It would also be great to switch operations that currently use non-vectorized reduction kernel (norm, std etc) to your implementation, because non-vectorized reduction performance is pretty bad.

peterbell10 · 2020-06-15T16:38:56Z

I have made this diagram, hopefully this makes it clearer:

aten/src/ATen/native/cpu/SumKernel.cpp

ngimel · 2020-06-17T23:41:06Z

I think this should be good to go, although replacing the most common reduction is scary :-)

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

peterbell10 · 2020-06-23T18:52:49Z

I've learned this algorithm has an established name, pairwise-summation or cascade-summation which lead to numpy/numpy#3685. So, it seems NumPy actually uses this exact algorithm already with slightly different parameters.

ngimel · 2020-06-24T20:36:29Z

@peterbell10 FYI, one of captum tests is failing with this PR, to repro, after installing captum go to tests/attr folder and run

pytest -v -s test_feature_permutation.py -k test_single_input

In all likelihood, the problem is not with this PR, and rather with captum tests relying on fp arithmetics that cannot be relied on, but I can't land with a failing test. I'm looking at the test in more detail.
Edit: yep, it's the test that for some reason expects that the sum of array with permuted elements will be different from the sum of original array.

… instability

peterbell10 added the open source label Jun 4, 2020

peterbell10 mentioned this pull request Jun 4, 2020

torch.mean returns a wildly incorrect result 0.3277 on YCbCr version of CIFAR10 on CPU with dtype=float32 #38716

Closed

peterbell10 marked this pull request as draft June 4, 2020 14:40

peterbell10 force-pushed the tree-sum branch from ea00802 to c7ce027 Compare June 9, 2020 11:45

peterbell10 marked this pull request as ready for review June 10, 2020 12:02

peterbell10 changed the title ~~WIP: Use tree-based sum for floats to avoid numerical instability~~ Use tree-based sum for floats to avoid numerical instability Jun 10, 2020

peterbell10 mentioned this pull request Jun 10, 2020

Batched CPU reductions #39512

Closed

peterbell10 force-pushed the tree-sum branch from ba30928 to fa7fa19 Compare June 10, 2020 14:08

peterbell10 requested a review from ezyang June 10, 2020 14:08

peterbell10 force-pushed the tree-sum branch from fa7fa19 to 3afceff Compare June 10, 2020 14:11

ezyang requested a review from ngimel June 10, 2020 14:52

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 10, 2020

peterbell10 force-pushed the tree-sum branch from 80f98bd to a4f48fe Compare June 11, 2020 13:45

peterbell10 added 7 commits June 14, 2020 13:24

Add tree-based sum to avoid numerical instability

5c7be44

Fix second axis reductions

580fdb5

Support non-reduction loops inside parallel_reduce

0a2a3bb

Add vectorization to sum kernel

f789faf

Better utilize last highest level accumulator

1a13952

Add sum benchmark

865ae6a

vectorized_outer_sum now calculates multiple row sums in parallel

42a6381

This is a direct analog of reduce128, but with the tree accumulator

ezyang reviewed Jun 15, 2020

View reviewed changes

Add comment explaining sum algorithm with simplified implementation

4b8e3d9

ngimel reviewed Jun 15, 2020

View reviewed changes

Rename store_result -> accumulate_result

9148a1e

ngimel reviewed Jun 16, 2020

View reviewed changes

aten/src/ATen/native/cpu/SumKernel.cpp Outdated Show resolved Hide resolved

Fix off by one error

fcea51d

ngimel approved these changes Jun 17, 2020

View reviewed changes

facebook-github-bot reviewed Jun 17, 2020

View reviewed changes

ezyang mentioned this pull request Jun 23, 2020

[FR] More consistent matrix norm for torch.norm #24802

Closed

peterbell10 changed the title ~~Use tree-based sum for floats to avoid numerical instability~~ Use cascade-summationt for floats to avoid numerical instability Jun 23, 2020

peterbell10 changed the title ~~Use cascade-summationt for floats to avoid numerical instability~~ Use cascade-summation for floats to avoid numerical instability Jun 23, 2020

facebook-github-bot closed this in 3dcc329 Jun 25, 2020

peterbell10 mentioned this pull request Jun 25, 2020

Generalized CPU vector reductions #40570

Open

ngimel mentioned this pull request Sep 20, 2020

torch.mean does not produce correct output on CPU #45041

Closed

taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Dec 2, 2020

fix: [pytorch#39516] Use tree-based sum for floats to avoid numerical…

c25269e

… instability

ngimel mentioned this pull request Mar 30, 2021

Running nn.GroupNorm on CPU and GPU return different results #54293

Closed

ngimel mentioned this pull request Apr 8, 2021

Improve numerical stability of GroupNorm #54921

Closed

ezyang mentioned this pull request Apr 12, 2021

nn.CrossEntropyLoss reductions for large CPUTensors yield wrong results #55657

Closed

fuzzyswan mentioned this pull request Oct 23, 2022

tf.range have accumulate floating point error on CPU tensorflow/tensorflow#58133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cascade-summation for floats to avoid numerical instability #39516

Use cascade-summation for floats to avoid numerical instability #39516

peterbell10 commented Jun 4, 2020 •

edited

Loading

dr-ci bot commented Jun 4, 2020 •

edited

Loading

peterbell10 commented Jun 10, 2020

peterbell10 commented Jun 10, 2020 •

edited

Loading

ezyang commented Jun 10, 2020

peterbell10 commented Jun 10, 2020

ngimel commented Jun 10, 2020

peterbell10 commented Jun 10, 2020 •

edited

Loading

peterbell10 commented Jun 11, 2020

peterbell10 commented Jun 12, 2020

ngimel commented Jun 12, 2020

ezyang Jun 15, 2020

ezyang Jun 15, 2020

ezyang Jun 15, 2020

peterbell10 Jun 15, 2020

ezyang Jun 15, 2020

peterbell10 Jun 15, 2020

ezyang commented Jun 15, 2020

peterbell10 commented Jun 15, 2020

ngimel Jun 15, 2020

ngimel Jun 15, 2020

peterbell10 Jun 15, 2020

ngimel Jun 15, 2020

peterbell10 Jun 16, 2020

ngimel Jun 17, 2020

peterbell10 Jun 18, 2020

ngimel Jun 15, 2020

peterbell10 Jun 15, 2020

ngimel commented Jun 15, 2020

peterbell10 commented Jun 15, 2020

ngimel commented Jun 17, 2020

facebook-github-bot left a comment

peterbell10 commented Jun 23, 2020

ngimel commented Jun 24, 2020 •

edited

Loading

	bool TensorIterator::is_dim_reduced(int dim) const {
	for (auto& op : operands_) {
	if (op.is_output && op.stride_bytes[dim] == 0 && shape_[dim] > 1) {
	return true;
	}
	}
	return false;
	}

Use cascade-summation for floats to avoid numerical instability #39516

Use cascade-summation for floats to avoid numerical instability #39516

Conversation

peterbell10 commented Jun 4, 2020 • edited Loading

dr-ci bot commented Jun 4, 2020 • edited Loading

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_windows_vs2019_py36_cuda10.1_build (1/1)

peterbell10 commented Jun 10, 2020

peterbell10 commented Jun 10, 2020 • edited Loading

ezyang commented Jun 10, 2020

peterbell10 commented Jun 10, 2020

ngimel commented Jun 10, 2020

peterbell10 commented Jun 10, 2020 • edited Loading

peterbell10 commented Jun 11, 2020

peterbell10 commented Jun 12, 2020

ngimel commented Jun 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jun 15, 2020

peterbell10 commented Jun 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngimel commented Jun 15, 2020

peterbell10 commented Jun 15, 2020

ngimel commented Jun 17, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

peterbell10 commented Jun 23, 2020

ngimel commented Jun 24, 2020 • edited Loading

peterbell10 commented Jun 4, 2020 •

edited

Loading

dr-ci bot commented Jun 4, 2020 •

edited

Loading

peterbell10 commented Jun 10, 2020 •

edited

Loading

peterbell10 commented Jun 10, 2020 •

edited

Loading

ngimel commented Jun 24, 2020 •

edited

Loading