Use deterministic impl of `index_put` and `index` backward CPU when `torch.are_deterministic_algorithms_enabled() == True` #51388

kurtamohler · 2021-01-29T23:38:03Z

Fixes #51366

facebook-github-bot · 2021-01-29T23:38:29Z

💊 CI failures summary and remediations

As of commit 2567ee8 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

codecov · 2021-01-30T03:33:18Z

Codecov Report

Merging #51388 (2567ee8) into master (a88e1d3) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #51388   +/-   ##
=======================================
  Coverage   80.85%   80.86%           
=======================================
  Files        1938     1938           
  Lines      211190   211190           
=======================================
+ Hits       170765   170773    +8     
+ Misses      40425    40417    -8

torch/__init__.py

aten/src/ATen/native/TensorAdvancedIndexing.cpp

test/test_torch.py

mruberry · 2021-02-01T02:20:31Z

test/test_torch.py

@@ -3456,6 +3456,23 @@ def _test_in_place_broadcastable(t0, t1, t2=None):
                _test_in_place_broadcastable(small2, small_expanded, large_expanded)
                _test_in_place_broadcastable(small2, small, large)

+    def test_index_put_nondeterministic_alert(self, device):


Is there a test for the dunder getitem behavior?

mruberry · 2021-02-01T02:24:06Z

torch/testing/_internal/common_methods_invocations.py

@@ -3031,6 +3031,9 @@ def method_tests():
        ('__getitem__', torch.randn(S, S, S), (dont_convert([[0, 3], Ellipsis]),), 'adv_index_sub_3'),
        ('__getitem__', torch.randn(S, S, S), (dont_convert([[0, 2, 3], [1, 3, 3],
                                                             torch.LongTensor([0, 0, 2])]),), 'adv_index_var'),
+        ('__getitem__', torch.randn(S, S, S), ([torch.LongTensor([0, 2])]),


Can we add an OpInfo for this?

mruberry · 2021-02-01T02:30:40Z

Hey @kurtamohler! It's great we're being diligent about deterministic coverage. I made a few comments inline; I'm particularly concerned about the potential performance impact, however, with this approach. Can we mitigate it and update the deterministic pattern so that users who have the deterministic flag disabled (the default) don't see a performance impact?

Separate question: should we look at adding a builtin_expect (see https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) to PyTorch to enforce that the branch predictor doesn't prioritize branches behind guarded on the global deterministic flag? If we were using C++20 it'd also be interesting to consider likely/unlikely (https://en.cppreference.com/w/cpp/language/attributes/likely).

cc @ngimel

torch/csrc/autograd/FunctionsManual.cpp

aten/src/ATen/native/TensorAdvancedIndexing.cpp

kurtamohler · 2021-02-01T20:30:53Z

Instead of adding a nondeterministic alert, this PR now just uses the serial implementation of index_put when the deterministic flag is set.

Unfortunately, there's no simple way to add tests for this behavior, as far as I know. Maybe soon we should consider creating a daily or weekly test suite that runs deterministic operations several times and ensures that they always return the same result given the same input. It wouldn't be able to guarantee determinism, but it would be good to have some kind of safety net.

@mruberry, I like your idea of adding a branch predictor hint. I'll create a new issue for that.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-02-04T05:41:19Z

@ngimel merged this pull request in c41678f.

Summary: The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel. Resolves #24751 Resolves #24614 Resolves #24640 Resolves #24772 This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in #51388 This PR also makes these two functions correct in the following aspects (all of them added to the tests as well): - Support for complex numbers - Correct handling of scalar inputs and zero-dimensional inputs - Implementation that does not do any copies nor sorting of any of the input tensors - Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`) - Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats) - Adds the `torch.put` function that was missing, (`index_put` exists, for example) - Corrected docs It also adds a much more thorough testing to the operations and their gradients. There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one? **Edit.** Benchmarks: <details> <summary>Script</summary> ```python from IPython import get_ipython import torch from itertools import product torch.manual_seed(13) torch.set_num_threads(1) ipython = get_ipython() cpu = torch.device('cpu') cuda = torch.device('cuda') def run_test(ndims, size, index_len, device, cmd): print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}") large_tensor = torch.rand(*([size] * ndims), device=device) small_tensor = torch.rand((index_len,), device=device) index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device) if cmd == "put": command = "large_tensor.put_(index, small_tensor, accumulate=False)" if device == cuda: command += "; torch.cuda.synchronize()" elif cmd == "accumulate": command = "large_tensor.put_(index, small_tensor, accumulate=True)" if device == cuda: command += "; torch.cuda.synchronize()" elif cmd == "take": command = "torch.take(large_tensor, index)" if device == cuda: command += "; torch.cuda.synchronize()" ipython.magic(f"timeit {command}") print() for method, device in product(["accumulate", "put", "take"], [cpu, cuda]): run_test(3, 1000, 10, device, method) run_test(3, 1000, 1000, device, method) run_test(3, 1000, 10000, device, method) run_test(2, 10000, 100000, device, method) ``` </details> ```python put_(accumulate=False) ``` <details> <summary>ATen CPU (1.5x - 2x speedup)</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>ATen GPU (same speed)</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> ```python put_(accumulate=True) ``` <details> <summary>ATen CPU (x2 speedup)</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>ATen GPU (3x - 11x speedup)</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> ```python take() ``` <details> <summary>ATen CPU (1.1x speedup)</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> <details> <summary>ATen GPU (same speed)</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> cc mruberry Pull Request resolved: #53356 Reviewed By: mruberry Differential Revision: D27520243 Pulled By: ngimel fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093

kurtamohler requested review from ngimel and mruberry January 29, 2021 23:38

kurtamohler requested a review from albanD as a code owner January 29, 2021 23:38

facebook-github-bot added the cla signed label Jan 29, 2021

Add nondeterministic alert to index_put and index backward CPU

dba6566

kurtamohler force-pushed the index-backward-nondet branch from 28b32a6 to dba6566 Compare January 29, 2021 23:39

pytorchbot added the open source label Jan 29, 2021

albanD reviewed Jan 31, 2021

View reviewed changes

torch/__init__.py Outdated Show resolved Hide resolved

mruberry reviewed Feb 1, 2021

View reviewed changes

aten/src/ATen/native/TensorAdvancedIndexing.cpp Outdated Show resolved Hide resolved

mruberry reviewed Feb 1, 2021

View reviewed changes

test/test_torch.py Outdated Show resolved Hide resolved

mruberry reviewed Feb 1, 2021

View reviewed changes

ngimel reviewed Feb 1, 2021

View reviewed changes

torch/csrc/autograd/FunctionsManual.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/TensorAdvancedIndexing.cpp Outdated Show resolved Hide resolved

Remove nondeterministic alerts in favor of using deterministic algorithm

2567ee8

kurtamohler changed the title ~~Add nondeterministic alert to index_put and index backward CPU~~ Use deterministic impl of index_put and index backward CPU when torch.are_deterministic_algorithms_enabled() == True Feb 1, 2021

This was referenced Feb 1, 2021

Add branch predictor hints to prefer Context::deterministicAlgorithms() == false #51493

Open

Consolidate determinism flag tests into one place #51498

Closed

H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 2, 2021

gchanan added this to the 1.8.0 milestone Feb 3, 2021

ngimel approved these changes Feb 3, 2021

View reviewed changes

facebook-github-bot reviewed Feb 3, 2021

View reviewed changes

facebook-github-bot closed this in c41678f Feb 4, 2021

facebook-github-bot added the Merged label Feb 4, 2021

lezcano mentioned this pull request Mar 5, 2021

Port put_ and take from TH to ATen #53356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use deterministic impl of `index_put` and `index` backward CPU when `torch.are_deterministic_algorithms_enabled() == True` #51388

Use deterministic impl of `index_put` and `index` backward CPU when `torch.are_deterministic_algorithms_enabled() == True` #51388

kurtamohler commented Jan 29, 2021 •

edited

facebook-github-bot commented Jan 29, 2021 •

edited

codecov bot commented Jan 30, 2021 •

edited

mruberry Feb 1, 2021

mruberry Feb 1, 2021

mruberry commented Feb 1, 2021

kurtamohler commented Feb 1, 2021 •

edited

facebook-github-bot left a comment

facebook-github-bot commented Feb 4, 2021

Use deterministic impl of index_put and index backward CPU when torch.are_deterministic_algorithms_enabled() == True #51388

Use deterministic impl of index_put and index backward CPU when torch.are_deterministic_algorithms_enabled() == True #51388

Conversation

kurtamohler commented Jan 29, 2021 • edited

facebook-github-bot commented Jan 29, 2021 • edited

💊 CI failures summary and remediations

codecov bot commented Jan 30, 2021 • edited

Codecov Report

mruberry Feb 1, 2021

Choose a reason for hiding this comment

mruberry Feb 1, 2021

Choose a reason for hiding this comment

mruberry commented Feb 1, 2021

kurtamohler commented Feb 1, 2021 • edited

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 4, 2021

Use deterministic impl of `index_put` and `index` backward CPU when `torch.are_deterministic_algorithms_enabled() == True` #51388

Use deterministic impl of `index_put` and `index` backward CPU when `torch.are_deterministic_algorithms_enabled() == True` #51388

kurtamohler commented Jan 29, 2021 •

edited

facebook-github-bot commented Jan 29, 2021 •

edited

codecov bot commented Jan 30, 2021 •

edited

kurtamohler commented Feb 1, 2021 •

edited