Implemented torch.linalg.multi_dot #51807

heitorschueroff · 2021-02-05T21:37:06Z

Stack from ghstack:

Implemented torch.linalg.multi_dot #51807 Implemented torch.linalg.multi_dot
Add torch.linalg to generated annotated_args for test_overrides #52464 Add torch.linalg to generated annotated_args for test_overrides
Add discontiguous kwarg to make_tensor #51985 Add discontiguous kwarg to make_tensor

Differential Revision: D26375734

Implemented torch.linalg.multi_dot similar to numpy.linalg.multi_dot.

This function does not support broadcasting or batched inputs at the moment.

NOTE
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions whereas here we only allows the them to be either 1D or 2D.

BENCHMARK
In the following benchmarks the labels are the sizes k0 x k1 x k2 x k3 and the matrices being multiplied have shapes (k0, k1) x (k1, k2) x (k2, k3) x (k3, k0). Baseline is just multiplying the tensors left to right in PyTorch.

cpu results

[-------------------------------  -------------------------------]
                                 |  Baseline  |  PyTorch  |  NumPy
1 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |    1500    |     900   |   1040
      81   x 1772 x 83   x 37    |     430    |     320   |    390
      3    x 3022 x 78   x 1598  |     110    |     120   |    200
      1082 x 2    x 78   x 5     |     322    |     400   |    630
      7    x 35   x 4077 x 63    |     140    |     140   |    200
      190  x 56   x 8984 x 2     |    2800    |     200   |    230
      79   x 311  x 22   x 500   |     161    |      60   |    100
      2247 x 2    x 8050 x 74    |   89100    |    3190   |   2760
      11   x 123  x 2    x 201   |      11    |      10   |     50
      54   x 1057 x 38   x 3838  |     600    |     330   |    400
8 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |     280    |     190   |    500
      81   x 1772 x 83   x 37    |     100    |      90   |    200
      3    x 3022 x 78   x 1598  |     600    |     600   |    600
      1082 x 2    x 78   x 5     |     100    |      80   |    910
      7    x 35   x 4077 x 63    |     130    |     100   |    160
      190  x 56   x 8984 x 2     |     900    |      99   |    150
      79   x 311  x 22   x 500   |      88    |      68   |    110
      2247 x 2    x 8050 x 74    |   18000    |    1000   |   3000
      11   x 123  x 2    x 201   |      11    |      11   |     45
      54   x 1057 x 38   x 3838  |     200    |     110   |    150

Times are in microseconds (us).

cuda results

[--------------------------  --------------------------]
                                 |  Baseline  |  PyTorch
1 threads: ---------------------------------------------
      725  x 257  x 151  x 49    |     62     |     40  
      81   x 1772 x 83   x 37    |     59     |     61  
      3    x 3022 x 78   x 1598  |     42     |     59  
      1082 x 2    x 78   x 5     |     36     |    181  
      7    x 35   x 4077 x 63    |    111     |    101  
      190  x 56   x 8984 x 2     |    112     |     60  
      79   x 311  x 22   x 500   |     34     |     35  
      2247 x 2    x 8050 x 74    |    910     |    175  
      11   x 123  x 2    x 201   |     34     |     38  
      54   x 1057 x 38   x 3838  |    139     |    133  

Times are in microseconds (us).

script

from torch.utils import benchmark
from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor

fuzzer = Fuzzer(
    parameters=[
        FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k2', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k3', minval=1, maxval=10000, distribution='loguniform'),
    ],
    tensors=[
        FuzzedTensor('a', size=('k0', 'k1'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('b', size=('k1', 'k2'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('c', size=('k2', 'k3'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('d', size=('k3', 'k0'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
    ],
    seed=0,
)

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    # description is the column label
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    for num_threads in (1, 8):
        results.append(benchmark.Timer(
            stmt='a @ b @ c @ d',
            globals=tensors,
            sub_label=sub_label,
            description='Baseline',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='torch.linalg.multi_dot((a, b, c, d))',
            globals=tensors,
            sub_label=sub_label,
            description='PyTorch',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='np.linalg.multi_dot((a, b, c, d))',
            setup='import numpy as np',
            globals=tensors,
            sub_label=sub_label,
            description='NumPy',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    tensors = {k: tensors[k].cuda() for k in tensors.keys()}
    results.append(benchmark.Timer(
        stmt='a @ b @ c @ d',
        globals=tensors,
        sub_label=sub_label,
        description='Baseline',
    ).blocked_autorange(min_run_time=1))
    results.append(benchmark.Timer(
        stmt='torch.linalg.multi_dot((a, b, c, d))',
        globals=tensors,
        sub_label=sub_label,
        description='PyTorch',
    ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()

TODO

Benchmark against NumPy
Add OpInfo testing
Remove unnecessary copy for out= argument

[ghstack-poisoned]

ghstack-source-id: c3e939b78950eaca6376c8675a263a6f78d1d0fd Pull Request resolved: #51807

facebook-github-bot · 2021-02-05T23:14:52Z

💊 CI failures summary and remediations

As of commit 905ce1b (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

mruberry · 2021-02-08T08:55:43Z

NOTE
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

This might be logically implied and seems like a reasonable limitation, but I don't think the docs actually say these must be 1 or 2D?

https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html

As you mentioned offline, PyTorch's dot (https://pytorch.org/docs/master/generated/torch.dot.html?highlight=dot#torch.dot) is distinct from NumPy's dot (https://numpy.org/doc/stable/reference/generated/numpy.dot.html). In particular, PyTorch's dot wouldn't support the multi_dot use case. Let's be sure to be clear about that.

mruberry · 2021-02-08T08:57:20Z

torch/linalg/__init__.py

@@ -431,6 +431,45 @@
            [1, 2, 2, 2]])
 """)

+multi_dot = _add_docstr(_linalg.linalg_multi_dot, r"""
+linalg.multi_dot(tensors, *, out=None)


Reviewer's note: the use of "tensors" here is consistent with the language in other functions that accept a list of tensors, like torch.cat (https://pytorch.org/docs/master/generated/torch.cat.html?highlight=cat#torch.cat)

torch/linalg/__init__.py

test/test_linalg.py

aten/src/ATen/native/LinearAlgebra.cpp

test/test_linalg.py

torch/linalg/__init__.py

mruberry · 2021-02-23T18:50:59Z

mypy issues aside this looks pretty good to me, I made a few comments. @IvanYashchuk, would you like to take another look?

IvanYashchuk

I am happy with the fixes in the implementation and that there is no code duplication with chain_matmul now. The only thing we need to remember is to loosen the restriction for the "out= variant" on dtypes.

IvanYashchuk · 2021-02-23T19:02:47Z

aten/src/ATen/native/LinearAlgebra.cpp

+    TORCH_CHECK(
+        dtype == out.dtype(),
+        "multi_dot(): expected out tensor to have dtype ",
+        dtype,
+        " but got ",
+        out.dtype());


For the follow-up PR, we should remember to change this requirement to c10::canCast instead.

Differential Revision: [D26375734](https://our.internmc.facebook.com/intern/diff/D26375734) Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html). This function does not support broadcasting or batched inputs at the moment. **NOTE** numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions whereas here we only allows the them to be either 1D or 2D. **BENCHMARK** In the following benchmarks the labels are the sizes ``k0 x k1 x k2 x k3`` and the matrices being multiplied have shapes ``(k0, k1) x (k1, k2) x (k2, k3) x (k3, k0)``. Baseline is just multiplying the tensors left to right in PyTorch. _cpu results_ ``` [------------------------------- -------------------------------] | Baseline | PyTorch | NumPy 1 threads: ------------------------------------------------------- 725 x 257 x 151 x 49 | 1500 | 900 | 1040 81 x 1772 x 83 x 37 | 430 | 320 | 390 3 x 3022 x 78 x 1598 | 110 | 120 | 200 1082 x 2 x 78 x 5 | 322 | 400 | 630 7 x 35 x 4077 x 63 | 140 | 140 | 200 190 x 56 x 8984 x 2 | 2800 | 200 | 230 79 x 311 x 22 x 500 | 161 | 60 | 100 2247 x 2 x 8050 x 74 | 89100 | 3190 | 2760 11 x 123 x 2 x 201 | 11 | 10 | 50 54 x 1057 x 38 x 3838 | 600 | 330 | 400 8 threads: ------------------------------------------------------- 725 x 257 x 151 x 49 | 280 | 190 | 500 81 x 1772 x 83 x 37 | 100 | 90 | 200 3 x 3022 x 78 x 1598 | 600 | 600 | 600 1082 x 2 x 78 x 5 | 100 | 80 | 910 7 x 35 x 4077 x 63 | 130 | 100 | 160 190 x 56 x 8984 x 2 | 900 | 99 | 150 79 x 311 x 22 x 500 | 88 | 68 | 110 2247 x 2 x 8050 x 74 | 18000 | 1000 | 3000 11 x 123 x 2 x 201 | 11 | 11 | 45 54 x 1057 x 38 x 3838 | 200 | 110 | 150 Times are in microseconds (us). ``` _cuda results_ ``` [-------------------------- --------------------------] | Baseline | PyTorch 1 threads: --------------------------------------------- 725 x 257 x 151 x 49 | 62 | 40 81 x 1772 x 83 x 37 | 59 | 61 3 x 3022 x 78 x 1598 | 42 | 59 1082 x 2 x 78 x 5 | 36 | 181 7 x 35 x 4077 x 63 | 111 | 101 190 x 56 x 8984 x 2 | 112 | 60 79 x 311 x 22 x 500 | 34 | 35 2247 x 2 x 8050 x 74 | 910 | 175 11 x 123 x 2 x 201 | 34 | 38 54 x 1057 x 38 x 3838 | 139 | 133 Times are in microseconds (us). ``` _script_ ```python from torch.utils import benchmark from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor fuzzer = Fuzzer( parameters=[ FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'), FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'), FuzzedParameter('k2', minval=1, maxval=10000, distribution='loguniform'), FuzzedParameter('k3', minval=1, maxval=10000, distribution='loguniform'), ], tensors=[ FuzzedTensor('a', size=('k0', 'k1'), min_elements=128, max_elements=1000000, probability_contiguous=0.6), FuzzedTensor('b', size=('k1', 'k2'), min_elements=128, max_elements=1000000, probability_contiguous=0.6), FuzzedTensor('c', size=('k2', 'k3'), min_elements=128, max_elements=1000000, probability_contiguous=0.6), FuzzedTensor('d', size=('k3', 'k0'), min_elements=128, max_elements=1000000, probability_contiguous=0.6), ], seed=0, ) results = [] for tensors, tensor_params, params in fuzzer.take(10): # description is the column label sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}" for num_threads in (1, 8): results.append(benchmark.Timer( stmt='a @ b @ c @ d', globals=tensors, sub_label=sub_label, description='Baseline', num_threads=num_threads, ).blocked_autorange(min_run_time=1)) results.append(benchmark.Timer( stmt='torch.linalg.multi_dot((a, b, c, d))', globals=tensors, sub_label=sub_label, description='PyTorch', num_threads=num_threads, ).blocked_autorange(min_run_time=1)) results.append(benchmark.Timer( stmt='np.linalg.multi_dot((a, b, c, d))', setup='import numpy as np', globals=tensors, sub_label=sub_label, description='NumPy', num_threads=num_threads, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.print() results = [] for tensors, tensor_params, params in fuzzer.take(10): sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}" tensors = {k: tensors[k].cuda() for k in tensors.keys()} results.append(benchmark.Timer( stmt='a @ b @ c @ d', globals=tensors, sub_label=sub_label, description='Baseline', ).blocked_autorange(min_run_time=1)) results.append(benchmark.Timer( stmt='torch.linalg.multi_dot((a, b, c, d))', globals=tensors, sub_label=sub_label, description='PyTorch', ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.print() ``` **TODO** - [x] Benchmark against NumPy - [x] Add OpInfo testing - [x] Remove unnecessary copy for out= argument [ghstack-poisoned]

ghstack-source-id: eeb51368c1261412489cb828d4b0259eaeadf6a9 Pull Request resolved: #51807

facebook-github-bot · 2021-02-24T23:32:55Z

@heitorschueroff merged this pull request in 0396f49.

facebook-github-bot · 2021-02-25T08:46:41Z

This pull request has been reverted by 92a4ee1.

Summary: Pull Request resolved: pytorch#51807 Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html). This function does not support broadcasting or batched inputs at the moment. **NOTE** numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction. **TODO** - [ ] Benchmark against NumPy - [x] Add OpInfo testing - [x] Remove unnecessary copy for out= argument Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D26375734 Pulled By: heitorschueroff fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b

Implemented torch.linalg.multidot

8103609

[ghstack-poisoned]

heitorschueroff requested review from albanD, glaringlee and soulitzer as code owners February 5, 2021 21:37

heitorschueroff added a commit that referenced this pull request Feb 5, 2021

Implemented torch.linalg.multidot

7ddbfc2

ghstack-source-id: c3e939b78950eaca6376c8675a263a6f78d1d0fd Pull Request resolved: #51807

heitorschueroff requested a review from mruberry February 5, 2021 21:42

facebook-github-bot added the cla signed label Feb 5, 2021