New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: array agnostic cov
, used in cluster
#19051
Conversation
[skip cirrus] [skip circle]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce the improvement in test skipping against main
(though with the usual caveats related to gh-19018 when GPUs are involved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite promising to me, thanks @lucascolley. I
I tested this new cov
implementation standalone like so:
>>> # ensure to try with `export SCIPY_ARRAY_API=1`
>>> import numpy as np
>>> import cupy as cp
>>> import torch
>>> from scipy._lib._array_api import cov, array_namespace
>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
>>> y = cp.array([[0, 2], [1, 1], [2, 0]]).T
>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T
>>> cov(x)
array([[ 1., -1.],
[-1., 1.]])
>>> cov(y)
array([[ 1., -1.],
[-1., 1.]])
>>> cov(t)
Traceback (most recent call last):
Cell In[6], line 1
cov(t)
File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/_array_api.py:174 in cov
dtype = xp.result_type(X, xp.float64)
File ~/code/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/array_api_compat/array_api_compat/torch/_aliases.py:136 in result_type
return torch.result_type(x, y)
TypeError: result_type() received an invalid combination of arguments - got (Tensor, torch.dtype), but expected one of:
* (Tensor tensor, Tensor other)
...
The one little hiccup there is that an integer array doesn't work here for PyTorch. As you can see below a float64 array works fine:
Performance-wise things all look good - the generic implementation is pretty competitive, either similar in performance or max <2x slower (for small arrays and for pytorch):
>>> %timeit np.cov(x)
27.4 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(x)
50.8 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit cov(y)
186 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
174 µs ± 664 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit torch.cov(t)
23.8 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> # now with large arrays
>>> x = np.random.randn(2, 100_000)
>>> y = cp.array(x)
>>> y.shape
(2, 100000)
>>> t = torch.tensor(x)
>>> t.shape
torch.Size([2, 100000])
>>> %timeit cov(x)
656 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit np.cov(x)
646 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(y)
353 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cp.cov(y)
374 µs ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit cov(t)
289 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.cov(t)
160 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The integer array hiccup is because of result_type
, not because of torch.cov
itself:
>>> t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T
>>> torch.cov(t)
tensor([[ 1., -1.],
[-1., 1.]])
>>> cov(t)
...
TypeError: result_type() received an invalid combination of arguments
I do have the sense that when we build out a set of these kinds of functions, we want to be calling torch.cov
here rather than our generic implementation - that'd be a fallback for unknown libraries.
>>> from scipy._lib.array_api_compat.array_api_compat import torch as torch_compat
>>> tc = torch_compat
>>> tc.result_type(tc.float32, tc.float64)
torch.float64
>>> tc.result_type(tc.int32, tc.float64)
...
TypeError: result_type() received an invalid combination of arguments - got (torch.dtype, torch.dty This comment in # This doesn't result_type(dtype, dtype) for non-array API dtypes
# because torch.result_type only accepts tensors. This does however, allow
# cross-kind promotion.
return torch.result_type(x, y) It is passing dtypes, not tensors. It does work with tensors: >>> import torch
>>> torch.result_type(torch.tensor([], dtype=torch.int32),
... torch.tensor([], dtype=torch.float64))
torch.float64 @lucascolley maybe you can fix this in |
[skip cirrus] [skip circle]
[skip cirrus] [skip circle]
513a18c
to
12e2917
Compare
@lucascolley if you update the |
[skip cirrus] [skip circle]
After updating In [1]: import torch
In [2]: t = torch.tensor([[0, 2], [1, 1], [2, 0]]).T
In [3]: from scipy._lib._array_api import cov
In [4]: cov(t)
Out[4]:
tensor([[ 1., -1.],
[-1., 1.]], dtype=torch.float64) |
CI failure is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All LGTM now, thanks Lucas! In it goes.
if w_sum.shape != avg.shape: | ||
w_sum = copy(xp.broadcast_to(w_sum, avg.shape), xp=xp) | ||
|
||
w_sum = w_sum[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on gh-19186 I noticed that this isn't used. Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, I tried to stay as true to np.cov
as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Since w_sum
doesn't appear for the rest of the function, the calculation doesn't seem to be needed anymore. I proposed removing it in lucascolley#7.
Reference issue
Part of #18866.
What does this implement/fix?
cov
is not in the array API standard.This PR introduces a new
cov
utility, based onnp.cov
, but which only uses functions that are in the standard, so that it is array agnostic.Note that this is not nearly a replacement for
np.cov
. It just takes an array and returns the covariance. This is all that is required by the functions which usecov
incluster
. A docstring like that ofcopy
above it, perhaps including this clarification, may be nice but I have omitted it for now. Happy to add this in if it is wanted.With this new helper, three test skips have been removed from
cluster
, and aTODO
has been addressed.Additional information
np.cov
source: https://github.com/numpy/numpy/blob/v1.25.0/numpy/lib/function_base.py#L2530-L2749k = np.asarray(k)
is introduced sincerng.standard_normal(size=k)
is not happy with takingk
as anumpy.array_api
array.I have kept in the
Degrees of freedom <= 0
warning fromnp.cov
, but this isn't needed forcluster
.