Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: array agnostic cov, used in cluster #19051

Merged
merged 11 commits into from Sep 6, 2023
Merged

Conversation

lucascolley
Copy link
Member

@lucascolley lucascolley commented Aug 11, 2023

Reference issue

Part of #18866.

What does this implement/fix?

cov is not in the array API standard.

This PR introduces a new cov utility, based on np.cov, but which only uses functions that are in the standard, so that it is array agnostic.

Note that this is not nearly a replacement for np.cov. It just takes an array and returns the covariance. This is all that is required by the functions which use cov in cluster. A docstring like that of copy above it, perhaps including this clarification, may be nice but I have omitted it for now. Happy to add this in if it is wanted.

With this new helper, three test skips have been removed from cluster, and a TODO has been addressed.

Additional information

np.cov source: https://github.com/numpy/numpy/blob/v1.25.0/numpy/lib/function_base.py#L2530-L2749

k = np.asarray(k) is introduced since rng.standard_normal(size=k) is not happy with taking k as a numpy.array_api array.

I have kept in the Degrees of freedom <= 0 warning from np.cov, but this isn't needed for cluster.

@lucascolley lucascolley marked this pull request as ready for review August 11, 2023 23:47
scipy/_lib/_array_api.py Outdated Show resolved Hide resolved
@rgommers rgommers added the array types Items related to array API support and input array validation (see gh-18286) label Aug 12, 2023
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce the improvement in test skipping against main (though with the usual caveats related to gh-19018 when GPUs are involved).

scipy/_lib/_array_api.py Outdated Show resolved Hide resolved
scipy/cluster/tests/test_vq.py Outdated Show resolved Hide resolved
scipy/_lib/_array_api.py Outdated Show resolved Hide resolved
scipy/cluster/vq.py Outdated Show resolved Hide resolved
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite promising to me, thanks @lucascolley. I

I tested this new cov implementation standalone like so:

>>> # ensure to try with `export SCIPY_ARRAY_API=1`
>>> import numpy as np
>>> import cupy as cp
>>> import torch
>>> from scipy._lib._array_api import cov, array_namespace

>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
>>> y = cp.array([[0, 2], [1, 1], [2, 0]]).T
>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T

>>> cov(x)
array([[ 1., -1.],
       [-1.,  1.]])
>>> cov(y)
array([[ 1., -1.],
       [-1.,  1.]])
>>> cov(t)
Traceback (most recent call last):
  Cell In[6], line 1
    cov(t)
  File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/_array_api.py:174 in cov
    dtype = xp.result_type(X, xp.float64)
  File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/array_api_compat/array_api_compat/torch/_aliases.py:136 in result_type
    return torch.result_type(x, y)
TypeError: result_type() received an invalid combination of arguments - got (Tensor, torch.dtype), but expected one of:
 * (Tensor tensor, Tensor other)
...

The one little hiccup there is that an integer array doesn't work here for PyTorch. As you can see below a float64 array works fine:

Performance-wise things all look good - the generic implementation is pretty competitive, either similar in performance or max <2x slower (for small arrays and for pytorch):

>>> %timeit np.cov(x)
27.4 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(x)
50.8 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(y)
186 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
174 µs ± 664 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit torch.cov(t)
23.8 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> # now with large arrays
>>> x = np.random.randn(2, 100_000)
>>> y = cp.array(x)
>>> y.shape
(2, 100000)
>>> t = torch.tensor(x)
>>> t.shape
torch.Size([2, 100000])
>>> %timeit cov(x)
656 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit np.cov(x)
646 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(y)
353 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
374 µs ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(t)
289 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.cov(t)
160 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The integer array hiccup is because of result_type, not because of torch.cov itself:

>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T
>>> torch.cov(t)
tensor([[ 1., -1.],
        [-1.,  1.]])
>>> cov(t)
...
TypeError: result_type() received an invalid combination of arguments

I do have the sense that when we build out a set of these kinds of functions, we want to be calling torch.cov here rather than our generic implementation - that'd be a fallback for unknown libraries.

@rgommers
Copy link
Member

>>> from scipy._lib.array_api_compat.array_api_compat import torch as torch_compat
>>> tc = torch_compat
>>> tc.result_type(tc.float32, tc.float64)
torch.float64
>>> tc.result_type(tc.int32, tc.float64)
...
TypeError: result_type() received an invalid combination of arguments - got (torch.dtype, torch.dty

This comment in array_api_compat seems to want to do the right thing, but the implementation is wrong:

    # This doesn't result_type(dtype, dtype) for non-array API dtypes
    # because torch.result_type only accepts tensors. This does however, allow
    # cross-kind promotion.
    return torch.result_type(x, y)

It is passing dtypes, not tensors. It does work with tensors:

>>> import torch
>>> torch.result_type(torch.tensor([], dtype=torch.int32),
...                   torch.tensor([], dtype=torch.float64))
torch.float64

@lucascolley maybe you can fix this in array_api_compat? And then we can update our git submodule with the fix in this PR.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2023

@lucascolley if you update the array_api_compat git submodule so it contains the fix we need here, then I think this is ready to merge.

@lucascolley
Copy link
Member Author

After updating array_api_compat the problem above is solved:

In [1]: import torch

In [2]: t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T

In [3]: from scipy._lib._array_api import cov

In [4]: cov(t)
Out[4]: 
tensor([[ 1., -1.],
        [-1.,  1.]], dtype=torch.float64)

@lucascolley
Copy link
Member Author

CI failure is in scipy/odr, looks unrelated?

@rgommers rgommers added this to the 1.12.0 milestone Sep 6, 2023
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All LGTM now, thanks Lucas! In it goes.

@rgommers rgommers merged commit 31ebea4 into scipy:main Sep 6, 2023
21 of 22 checks passed
if w_sum.shape != avg.shape:
w_sum = copy(xp.broadcast_to(w_sum, avg.shape), xp=xp)

w_sum = w_sum[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on gh-19186 I noticed that this isn't used. Is it needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I tried to stay as true to np.cov as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Since w_sum doesn't appear for the rest of the function, the calculation doesn't seem to be needed anymore. I proposed removing it in lucascolley#7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) scipy.cluster scipy._lib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants