ENH: array agnostic `cov`, used in `cluster` #19051

lucascolley · 2023-08-11T23:32:36Z

Reference issue

Part of #18866.

What does this implement/fix?

cov is not in the array API standard.

This PR introduces a new cov utility, based on np.cov, but which only uses functions that are in the standard, so that it is array agnostic.

Note that this is not nearly a replacement for np.cov. It just takes an array and returns the covariance. This is all that is required by the functions which use cov in cluster. A docstring like that of copy above it, perhaps including this clarification, may be nice but I have omitted it for now. Happy to add this in if it is wanted.

With this new helper, three test skips have been removed from cluster, and a TODO has been addressed.

Additional information

np.cov source: https://github.com/numpy/numpy/blob/v1.25.0/numpy/lib/function_base.py#L2530-L2749

k = np.asarray(k) is introduced since rng.standard_normal(size=k) is not happy with taking k as a numpy.array_api array.

I have kept in the Degrees of freedom <= 0 warning from np.cov, but this isn't needed for cluster.

scipy/_lib/_array_api.py

[skip cirrus] [skip circle]

tylerjereddy

I was able to reproduce the improvement in test skipping against main (though with the usual caveats related to gh-19018 when GPUs are involved).

scipy/_lib/_array_api.py

scipy/cluster/tests/test_vq.py

scipy/_lib/_array_api.py

scipy/cluster/vq.py

rgommers

This looks quite promising to me, thanks @lucascolley. I

I tested this new cov implementation standalone like so:

>>> # ensure to try with `export SCIPY_ARRAY_API=1`
>>> import numpy as np
>>> import cupy as cp
>>> import torch
>>> from scipy._lib._array_api import cov, array_namespace

>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
>>> y = cp.array([[0, 2], [1, 1], [2, 0]]).T
>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T

>>> cov(x)
array([[ 1., -1.],
       [-1.,  1.]])
>>> cov(y)
array([[ 1., -1.],
       [-1.,  1.]])
>>> cov(t)
Traceback (most recent call last):
  Cell In[6], line 1
    cov(t)
  File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/_array_api.py:174 in cov
    dtype = xp.result_type(X, xp.float64)
  File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/array_api_compat/array_api_compat/torch/_aliases.py:136 in result_type
    return torch.result_type(x, y)
TypeError: result_type() received an invalid combination of arguments - got (Tensor, torch.dtype), but expected one of:
 * (Tensor tensor, Tensor other)
...

The one little hiccup there is that an integer array doesn't work here for PyTorch. As you can see below a float64 array works fine:

Performance-wise things all look good - the generic implementation is pretty competitive, either similar in performance or max <2x slower (for small arrays and for pytorch):

>>> %timeit np.cov(x)
27.4 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(x)
50.8 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(y)
186 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
174 µs ± 664 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit torch.cov(t)
23.8 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> # now with large arrays
>>> x = np.random.randn(2, 100_000)
>>> y = cp.array(x)
>>> y.shape
(2, 100000)
>>> t = torch.tensor(x)
>>> t.shape
torch.Size([2, 100000])
>>> %timeit cov(x)
656 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit np.cov(x)
646 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(y)
353 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
374 µs ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(t)
289 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.cov(t)
160 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The integer array hiccup is because of result_type, not because of torch.cov itself:

>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T
>>> torch.cov(t)
tensor([[ 1., -1.],
        [-1.,  1.]])
>>> cov(t)
...
TypeError: result_type() received an invalid combination of arguments

I do have the sense that when we build out a set of these kinds of functions, we want to be calling torch.cov here rather than our generic implementation - that'd be a fallback for unknown libraries.

rgommers · 2023-08-30T14:28:45Z

>>> from scipy._lib.array_api_compat.array_api_compat import torch as torch_compat
>>> tc = torch_compat
>>> tc.result_type(tc.float32, tc.float64)
torch.float64
>>> tc.result_type(tc.int32, tc.float64)
...
TypeError: result_type() received an invalid combination of arguments - got (torch.dtype, torch.dty

This comment in array_api_compat seems to want to do the right thing, but the implementation is wrong:

    # This doesn't result_type(dtype, dtype) for non-array API dtypes
    # because torch.result_type only accepts tensors. This does however, allow
    # cross-kind promotion.
    return torch.result_type(x, y)

It is passing dtypes, not tensors. It does work with tensors:

>>> import torch
>>> torch.result_type(torch.tensor([], dtype=torch.int32),
...                   torch.tensor([], dtype=torch.float64))
torch.float64

@lucascolley maybe you can fix this in array_api_compat? And then we can update our git submodule with the fix in this PR.

[skip cirrus] [skip circle]

rgommers · 2023-09-06T08:07:27Z

@lucascolley if you update the array_api_compat git submodule so it contains the fix we need here, then I think this is ready to merge.

[skip cirrus] [skip circle]

lucascolley · 2023-09-06T12:44:52Z

After updating array_api_compat the problem above is solved:

In [1]: import torch

In [2]: t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T

In [3]: from scipy._lib._array_api import cov

In [4]: cov(t)
Out[4]: 
tensor([[ 1., -1.],
        [-1.,  1.]], dtype=torch.float64)

lucascolley · 2023-09-06T13:15:56Z

CI failure is in scipy/odr, looks unrelated?

rgommers

All LGTM now, thanks Lucas! In it goes.

mdhaber · 2023-09-08T23:52:01Z

scipy/_lib/_array_api.py

+    if w_sum.shape != avg.shape:
+        w_sum = copy(xp.broadcast_to(w_sum, avg.shape), xp=xp)
+
+    w_sum = w_sum[0]


While working on gh-19186 I noticed that this isn't used. Is it needed?

No idea, I tried to stay as true to np.cov as possible.

OK. Since w_sum doesn't appear for the rest of the function, the calculation doesn't seem to be needed anymore. I proposed removing it in lucascolley#7.

ENH: array agnostic cov, used in cluster

509d14a

github-actions bot added scipy._lib scipy.cluster labels Aug 11, 2023

use xp=None and misc. changes

22c1292

lucascolley marked this pull request as ready for review August 11, 2023 23:47

lucascolley commented Aug 12, 2023

View reviewed changes

scipy/_lib/_array_api.py Outdated Show resolved Hide resolved

rgommers added the array types Items related to array API support and input array validation (see gh-18286) label Aug 12, 2023

lucascolley and others added 2 commits August 17, 2023 21:31

Merge branch 'main' into cluster_cov

0c37066

MAINT: remove unused import

e1d69d5

[skip cirrus] [skip circle]

rkern mentioned this pull request Aug 18, 2023

ENH: linalg: array library interoperability #19068

Open

tylerjereddy reviewed Aug 22, 2023

View reviewed changes

scipy/_lib/_array_api.py Outdated Show resolved Hide resolved

scipy/cluster/tests/test_vq.py Outdated Show resolved Hide resolved

lucascolley commented Aug 23, 2023

View reviewed changes

scipy/_lib/_array_api.py Outdated Show resolved Hide resolved

scipy/cluster/vq.py Outdated Show resolved Hide resolved

lucascolley added 3 commits August 23, 2023 16:34

apply review suggestions

09c0c68

Merge branch 'main' into cluster_cov

fc72376

Merge branch 'main' into cluster_cov

538a10f

rgommers reviewed Aug 30, 2023

View reviewed changes

lucascolley mentioned this pull request Aug 30, 2023

BUG: fix torch.result_type cross-kind promotion data-apis/array-api-compat#55

Merged

lucascolley added 3 commits August 30, 2023 17:02

MAINT: use /=

3d628de

MAINT: simplify cov logic

8c72eff

[skip cirrus] [skip circle]

Merge branch 'main' into cluster_cov

12e2917

[skip cirrus] [skip circle]

lucascolley force-pushed the cluster_cov branch from 513a18c to 12e2917 Compare September 5, 2023 15:27

MAINT: update array_api_compat git submodule

b02a2e1

[skip cirrus] [skip circle]

lucascolley mentioned this pull request Sep 6, 2023

BUG/TST: fix test_array_api for cupy #19194

Merged

rgommers added this to the 1.12.0 milestone Sep 6, 2023

rgommers approved these changes Sep 6, 2023

View reviewed changes

rgommers merged commit 31ebea4 into scipy:main Sep 6, 2023
21 of 22 checks passed

lucascolley mentioned this pull request Sep 6, 2023

follow-up actions for array API support in cluster #18866

Open

7 tasks

lucascolley deleted the cluster_cov branch September 7, 2023 14:37

mdhaber reviewed Sep 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: array agnostic `cov`, used in `cluster` #19051

ENH: array agnostic `cov`, used in `cluster` #19051

lucascolley commented Aug 11, 2023 •

edited

tylerjereddy left a comment

rgommers left a comment

rgommers commented Aug 30, 2023

rgommers commented Sep 6, 2023

lucascolley commented Sep 6, 2023

lucascolley commented Sep 6, 2023

rgommers left a comment

mdhaber Sep 8, 2023

lucascolley Sep 9, 2023

mdhaber Sep 9, 2023

ENH: array agnostic cov, used in cluster #19051

ENH: array agnostic cov, used in cluster #19051

Conversation

lucascolley commented Aug 11, 2023 • edited

Reference issue

What does this implement/fix?

Additional information

tylerjereddy left a comment

Choose a reason for hiding this comment

rgommers left a comment

Choose a reason for hiding this comment

rgommers commented Aug 30, 2023

rgommers commented Sep 6, 2023

lucascolley commented Sep 6, 2023

lucascolley commented Sep 6, 2023

rgommers left a comment

Choose a reason for hiding this comment

mdhaber Sep 8, 2023

Choose a reason for hiding this comment

lucascolley Sep 9, 2023

Choose a reason for hiding this comment

mdhaber Sep 9, 2023

Choose a reason for hiding this comment

ENH: array agnostic `cov`, used in `cluster` #19051

ENH: array agnostic `cov`, used in `cluster` #19051

lucascolley commented Aug 11, 2023 •

edited