Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented torch.linalg.multi_dot #51807

Closed

Conversation

heitorschueroff
Copy link
Contributor

@heitorschueroff heitorschueroff commented Feb 5, 2021

Stack from ghstack:

Differential Revision: D26375734

Implemented torch.linalg.multi_dot similar to numpy.linalg.multi_dot.

This function does not support broadcasting or batched inputs at the moment.

NOTE
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions whereas here we only allows the them to be either 1D or 2D.

BENCHMARK
In the following benchmarks the labels are the sizes k0 x k1 x k2 x k3 and the matrices being multiplied have shapes (k0, k1) x (k1, k2) x (k2, k3) x (k3, k0). Baseline is just multiplying the tensors left to right in PyTorch.

cpu results

[-------------------------------  -------------------------------]
                                 |  Baseline  |  PyTorch  |  NumPy
1 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |    1500    |     900   |   1040
      81   x 1772 x 83   x 37    |     430    |     320   |    390
      3    x 3022 x 78   x 1598  |     110    |     120   |    200
      1082 x 2    x 78   x 5     |     322    |     400   |    630
      7    x 35   x 4077 x 63    |     140    |     140   |    200
      190  x 56   x 8984 x 2     |    2800    |     200   |    230
      79   x 311  x 22   x 500   |     161    |      60   |    100
      2247 x 2    x 8050 x 74    |   89100    |    3190   |   2760
      11   x 123  x 2    x 201   |      11    |      10   |     50
      54   x 1057 x 38   x 3838  |     600    |     330   |    400
8 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |     280    |     190   |    500
      81   x 1772 x 83   x 37    |     100    |      90   |    200
      3    x 3022 x 78   x 1598  |     600    |     600   |    600
      1082 x 2    x 78   x 5     |     100    |      80   |    910
      7    x 35   x 4077 x 63    |     130    |     100   |    160
      190  x 56   x 8984 x 2     |     900    |      99   |    150
      79   x 311  x 22   x 500   |      88    |      68   |    110
      2247 x 2    x 8050 x 74    |   18000    |    1000   |   3000
      11   x 123  x 2    x 201   |      11    |      11   |     45
      54   x 1057 x 38   x 3838  |     200    |     110   |    150

Times are in microseconds (us).

cuda results

[--------------------------  --------------------------]
                                 |  Baseline  |  PyTorch
1 threads: ---------------------------------------------
      725  x 257  x 151  x 49    |     62     |     40  
      81   x 1772 x 83   x 37    |     59     |     61  
      3    x 3022 x 78   x 1598  |     42     |     59  
      1082 x 2    x 78   x 5     |     36     |    181  
      7    x 35   x 4077 x 63    |    111     |    101  
      190  x 56   x 8984 x 2     |    112     |     60  
      79   x 311  x 22   x 500   |     34     |     35  
      2247 x 2    x 8050 x 74    |    910     |    175  
      11   x 123  x 2    x 201   |     34     |     38  
      54   x 1057 x 38   x 3838  |    139     |    133  

Times are in microseconds (us).

script

from torch.utils import benchmark
from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor

fuzzer = Fuzzer(
    parameters=[
        FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k2', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k3', minval=1, maxval=10000, distribution='loguniform'),
    ],
    tensors=[
        FuzzedTensor('a', size=('k0', 'k1'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('b', size=('k1', 'k2'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('c', size=('k2', 'k3'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('d', size=('k3', 'k0'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
    ],
    seed=0,
)

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    # description is the column label
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    for num_threads in (1, 8):
        results.append(benchmark.Timer(
            stmt='a @ b @ c @ d',
            globals=tensors,
            sub_label=sub_label,
            description='Baseline',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='torch.linalg.multi_dot((a, b, c, d))',
            globals=tensors,
            sub_label=sub_label,
            description='PyTorch',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='np.linalg.multi_dot((a, b, c, d))',
            setup='import numpy as np',
            globals=tensors,
            sub_label=sub_label,
            description='NumPy',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    tensors = {k: tensors[k].cuda() for k in tensors.keys()}
    results.append(benchmark.Timer(
        stmt='a @ b @ c @ d',
        globals=tensors,
        sub_label=sub_label,
        description='Baseline',
    ).blocked_autorange(min_run_time=1))
    results.append(benchmark.Timer(
        stmt='torch.linalg.multi_dot((a, b, c, d))',
        globals=tensors,
        sub_label=sub_label,
        description='PyTorch',
    ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()

TODO

  • Benchmark against NumPy
  • Add OpInfo testing
  • Remove unnecessary copy for out= argument

heitorschueroff added a commit that referenced this pull request Feb 5, 2021
ghstack-source-id: c3e939b78950eaca6376c8675a263a6f78d1d0fd
Pull Request resolved: #51807
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 5, 2021

💊 CI failures summary and remediations

As of commit 905ce1b (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@mruberry
Copy link
Collaborator

mruberry commented Feb 8, 2021

NOTE
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

This might be logically implied and seems like a reasonable limitation, but I don't think the docs actually say these must be 1 or 2D?

https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html

As you mentioned offline, PyTorch's dot (https://pytorch.org/docs/master/generated/torch.dot.html?highlight=dot#torch.dot) is distinct from NumPy's dot (https://numpy.org/doc/stable/reference/generated/numpy.dot.html). In particular, PyTorch's dot wouldn't support the multi_dot use case. Let's be sure to be clear about that.

@@ -431,6 +431,45 @@
[1, 2, 2, 2]])
""")

multi_dot = _add_docstr(_linalg.linalg_multi_dot, r"""
linalg.multi_dot(tensors, *, out=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewer's note: the use of "tensors" here is consistent with the language in other functions that accept a list of tensors, like torch.cat (https://pytorch.org/docs/master/generated/torch.cat.html?highlight=cat#torch.cat)

torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
test/test_linalg.py Outdated Show resolved Hide resolved
test/test_linalg.py Outdated Show resolved Hide resolved
test/test_linalg.py Outdated Show resolved Hide resolved
@mruberry mruberry requested review from IvanYashchuk and removed request for albanD, soulitzer and glaringlee February 8, 2021 09:28
test/test_linalg.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
torch/linalg/__init__.py Outdated Show resolved Hide resolved
@mruberry
Copy link
Collaborator

mypy issues aside this looks pretty good to me, I made a few comments. @IvanYashchuk, would you like to take another look?

Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the fixes in the implementation and that there is no code duplication with chain_matmul now. The only thing we need to remember is to loosen the restriction for the "out= variant" on dtypes.

Comment on lines +408 to +413
TORCH_CHECK(
dtype == out.dtype(),
"multi_dot(): expected out tensor to have dtype ",
dtype,
" but got ",
out.dtype());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the follow-up PR, we should remember to change this requirement to c10::canCast instead.

Differential Revision: [D26375734](https://our.internmc.facebook.com/intern/diff/D26375734)

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions whereas here we only allows the them to be either 1D or 2D.

**BENCHMARK**
In the following benchmarks the labels are the sizes ``k0 x k1 x k2 x k3`` and the matrices being multiplied have shapes ``(k0, k1) x (k1, k2) x (k2, k3) x (k3, k0)``. Baseline is just multiplying the tensors left to right in PyTorch.

_cpu results_
```
[-------------------------------  -------------------------------]
                                 |  Baseline  |  PyTorch  |  NumPy
1 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |    1500    |     900   |   1040
      81   x 1772 x 83   x 37    |     430    |     320   |    390
      3    x 3022 x 78   x 1598  |     110    |     120   |    200
      1082 x 2    x 78   x 5     |     322    |     400   |    630
      7    x 35   x 4077 x 63    |     140    |     140   |    200
      190  x 56   x 8984 x 2     |    2800    |     200   |    230
      79   x 311  x 22   x 500   |     161    |      60   |    100
      2247 x 2    x 8050 x 74    |   89100    |    3190   |   2760
      11   x 123  x 2    x 201   |      11    |      10   |     50
      54   x 1057 x 38   x 3838  |     600    |     330   |    400
8 threads: -------------------------------------------------------
      725  x 257  x 151  x 49    |     280    |     190   |    500
      81   x 1772 x 83   x 37    |     100    |      90   |    200
      3    x 3022 x 78   x 1598  |     600    |     600   |    600
      1082 x 2    x 78   x 5     |     100    |      80   |    910
      7    x 35   x 4077 x 63    |     130    |     100   |    160
      190  x 56   x 8984 x 2     |     900    |      99   |    150
      79   x 311  x 22   x 500   |      88    |      68   |    110
      2247 x 2    x 8050 x 74    |   18000    |    1000   |   3000
      11   x 123  x 2    x 201   |      11    |      11   |     45
      54   x 1057 x 38   x 3838  |     200    |     110   |    150

Times are in microseconds (us).
```

_cuda results_
```
[--------------------------  --------------------------]
                                 |  Baseline  |  PyTorch
1 threads: ---------------------------------------------
      725  x 257  x 151  x 49    |     62     |     40  
      81   x 1772 x 83   x 37    |     59     |     61  
      3    x 3022 x 78   x 1598  |     42     |     59  
      1082 x 2    x 78   x 5     |     36     |    181  
      7    x 35   x 4077 x 63    |    111     |    101  
      190  x 56   x 8984 x 2     |    112     |     60  
      79   x 311  x 22   x 500   |     34     |     35  
      2247 x 2    x 8050 x 74    |    910     |    175  
      11   x 123  x 2    x 201   |     34     |     38  
      54   x 1057 x 38   x 3838  |    139     |    133  

Times are in microseconds (us).
```

_script_
```python
from torch.utils import benchmark
from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor

fuzzer = Fuzzer(
    parameters=[
        FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k2', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter('k3', minval=1, maxval=10000, distribution='loguniform'),
    ],
    tensors=[
        FuzzedTensor('a', size=('k0', 'k1'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('b', size=('k1', 'k2'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('c', size=('k2', 'k3'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
        FuzzedTensor('d', size=('k3', 'k0'), min_elements=128, max_elements=1000000, probability_contiguous=0.6),
    ],
    seed=0,
)

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    # description is the column label
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    for num_threads in (1, 8):
        results.append(benchmark.Timer(
            stmt='a @ b @ c @ d',
            globals=tensors,
            sub_label=sub_label,
            description='Baseline',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='torch.linalg.multi_dot((a, b, c, d))',
            globals=tensors,
            sub_label=sub_label,
            description='PyTorch',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='np.linalg.multi_dot((a, b, c, d))',
            setup='import numpy as np',
            globals=tensors,
            sub_label=sub_label,
            description='NumPy',
            num_threads=num_threads,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()

results = []
for tensors, tensor_params, params in fuzzer.take(10):
    sub_label = f"{params['k0']:<4} x {params['k1']:<4} x {params['k2']:<4} x {params['k3']:<4}"
    tensors = {k: tensors[k].cuda() for k in tensors.keys()}
    results.append(benchmark.Timer(
        stmt='a @ b @ c @ d',
        globals=tensors,
        sub_label=sub_label,
        description='Baseline',
    ).blocked_autorange(min_run_time=1))
    results.append(benchmark.Timer(
        stmt='torch.linalg.multi_dot((a, b, c, d))',
        globals=tensors,
        sub_label=sub_label,
        description='PyTorch',
    ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()
```

**TODO**
- [x] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument


[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Feb 23, 2021
ghstack-source-id: eeb51368c1261412489cb828d4b0259eaeadf6a9
Pull Request resolved: #51807
@facebook-github-bot
Copy link
Contributor

@heitorschueroff merged this pull request in 0396f49.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 92a4ee1.

@facebook-github-bot facebook-github-bot deleted the gh/heitorschueroff/48/head branch February 28, 2021 15:16
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
Pull Request resolved: pytorch#51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Pull Request resolved: pytorch#51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants