Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: np-coercible array-likes are not accepted with array API flag set #19118

Closed
lucascolley opened this issue Aug 23, 2023 · 13 comments · Fixed by #19187
Closed

BUG: np-coercible array-likes are not accepted with array API flag set #19118

lucascolley opened this issue Aug 23, 2023 · 13 comments · Fixed by #19187
Labels
array types Items related to array API support and input array validation (see gh-18286) defect A clear bug or issue that prevents SciPy from being installed or used as expected
Milestone

Comments

@lucascolley
Copy link
Member

lucascolley commented Aug 23, 2023

Describe your issue.

From offline discussion with @rgommers, behaviour with np-coercible array-likes should not have regressed in #18668, but it has.

One potential solution is to attempt to use np.asarray in compliance_scipy on types unrecognised by array-api-compat, a (definitely sub-optimal but I think correct) draft of which I have here: lucascolley@7ecfcc6.

Reproducing Code Example

>>> python dev.py ipython

>>> from scipy.cluster.vq import whiten

>>> whiten([[1.9, 2.3, 1.7],
...:       [1.5, 2.5, 2.2],
...:       [0.8, 0.6, 1.7,]])

# reset

>>> SCIPY_ARRAY_API=True python dev.py ipython

>>> from scipy.cluster.vq import whiten

>>> whiten([[1.9, 2.3, 1.7],
...:       [1.5, 2.5, 2.2],
...:       [0.8, 0.6, 1.7,]])

Error message

(scipy-dev) lucascolley@Lucass-MacBook-Air-2 scipy % python dev.py ipython 
💻  ninja -C /Users/lucascolley/programming/scipy/build -j8
ninja: Entering directory `/Users/lucascolley/programming/scipy/build'
[4/4] Generating scipy/generate-version with a custom command
Build OK
💻  meson install -C build --only-changed
Installing, see meson-install.log...
Installation OK
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from scipy.cluster.vq import whiten

In [2]: whiten([[1.9, 2.3, 1.7],
   ...:         [1.5, 2.5, 2.2],
   ...:         [0.8, 0.6, 1.7,]])
Out[2]: 
array([[4.17944278, 2.69811351, 7.21248917],
       [3.29956009, 2.93273208, 9.33380951],
       [1.75976538, 0.7038557 , 7.21248917]])


zsh: suspended  python dev.py ipython
(scipy-dev) lucascolley@Lucass-MacBook-Air-2 scipy % SCIPY_ARRAY_API=True python dev.py ipython
💻  ninja -C /Users/lucascolley/programming/scipy/build -j8
ninja: Entering directory `/Users/lucascolley/programming/scipy/build'
[4/4] Generating scipy/generate-version with a custom command
Build OK
💻  meson install -C build --only-changed
Installing, see meson-install.log...
Installation OK
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from scipy.cluster.vq import whiten

In [2]: whiten([[1.9, 2.3, 1.7],
   ...:         [1.5, 2.5, 2.2],
   ...:         [0.8, 0.6, 1.7,]])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 whiten([[1.9, 2.3, 1.7],
      2         [1.5, 2.5, 2.2],
      3         [0.8, 0.6, 1.7,]])

File ~/programming/scipy/build-install/lib/python3.10/site-packages/scipy/cluster/vq.py:134, in whiten(obs, check_finite)
     87 def whiten(obs, check_finite=True):
     88     """
     89     Normalize a group of observations on a per feature basis.
     90 
   (...)
    132 
    133     """
--> 134     xp = array_namespace(obs)
    135     obs = as_xparray(obs, check_finite=check_finite, xp=xp)
    136     std_dev = xp.std(obs, axis=0)

File ~/programming/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/_array_api.py:93, in array_namespace(*arrays)
     89     return array_api_compat_numpy
     91 arrays = [array for array in arrays if array is not None]
---> 93 compliance_scipy(*arrays)
     95 return array_api_compat.array_namespace(*arrays)

File ~/programming/scipy/build-install/lib/python3.10/site-packages/scipy/_lib/_array_api.py:46, in compliance_scipy(*arrays)
     44     raise TypeError("'numpy.matrix' are not supported")
     45 elif not array_api_compat.is_array_api_obj(array):
---> 46     raise TypeError("Only support Array API compatible arrays")
     47 elif array.dtype is np.dtype('O'):
     48     raise TypeError('object arrays are not supported')

TypeError: Only support Array API compatible arrays

SciPy/NumPy/Python version and system information

1.12.0.dev0+1533.23ed5fe 1.25.2 sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /Users/lucascolley/mambaforge/envs/scipy-dev/include
    lib directory: /Users/lucascolley/mambaforge/envs/scipy-dev/lib
    name: openblas
    openblas configuration: USE_64BITINT=0 DYNAMIC_ARCH=0 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK=0 NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP=1 VORTEX MAX_THREADS=128
    pc file directory: /Users/lucascolley/mambaforge/envs/scipy-dev/lib/pkgconfig
    version: 0.3.23
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /Users/lucascolley/mambaforge/envs/scipy-dev/include
    lib directory: /Users/lucascolley/mambaforge/envs/scipy-dev/lib
    name: openblas
    openblas configuration: USE_64BITINT=0 DYNAMIC_ARCH=0 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK=0 NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP=1 VORTEX MAX_THREADS=128
    pc file directory: /Users/lucascolley/mambaforge/envs/scipy-dev/lib/pkgconfig
    version: 0.3.23
  pybind11:
    detection method: pkgconfig
    include directory: /Users/lucascolley/mambaforge/envs/scipy-dev/include
    name: pybind11
    version: 2.11.1
Compilers:
  c:
    args: -ftree-vectorize, -fPIC, -fPIE, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/lucascolley/mambaforge/envs/scipy-dev/include, -D_FORTIFY_SOURCE=2, -isystem,
      /Users/lucascolley/mambaforge/envs/scipy-dev/include
    commands: arm64-apple-darwin20.0.0-clang
    linker: ld64
    linker args: -Wl,-pie, -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs,
      -Wl,-rpath,/Users/lucascolley/mambaforge/envs/scipy-dev/lib, -L/Users/lucascolley/mambaforge/envs/scipy-dev/lib,
      -ftree-vectorize, -fPIC, -fPIE, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/lucascolley/mambaforge/envs/scipy-dev/include, -D_FORTIFY_SOURCE=2, -isystem,
      /Users/lucascolley/mambaforge/envs/scipy-dev/include
    name: clang
    version: 15.0.7
  c++:
    args: -ftree-vectorize, -fPIC, -fPIE, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include
    commands: arm64-apple-darwin20.0.0-clang++
    linker: ld64
    linker args: -Wl,-pie, -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs,
      -Wl,-rpath,/Users/lucascolley/mambaforge/envs/scipy-dev/lib, -L/Users/lucascolley/mambaforge/envs/scipy-dev/lib,
      -ftree-vectorize, -fPIC, -fPIE, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include
    name: clang
    version: 15.0.7
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.0
  fortran:
    args: -march=armv8.3-a, -ftree-vectorize, -fPIC, -fno-stack-protector, -O2, -pipe,
      -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include
    commands: /Users/lucascolley/mambaforge/envs/scipy-dev/bin/arm64-apple-darwin20.0.0-gfortran
    linker: ld64
    linker args: -Wl,-pie, -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs,
      -Wl,-rpath,/Users/lucascolley/mambaforge/envs/scipy-dev/lib, -L/Users/lucascolley/mambaforge/envs/scipy-dev/lib,
      -march=armv8.3-a, -ftree-vectorize, -fPIC, -fno-stack-protector, -O2, -pipe,
      -isystem, /Users/lucascolley/mambaforge/envs/scipy-dev/include
    name: gcc
    version: 12.3.0
  pythran:
    include directory: ../../../mambaforge/envs/scipy-dev/lib/python3.10/site-packages/pythran
    version: 0.13.1
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  cross-compiled: false
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/lucascolley/mambaforge/envs/scipy-dev/bin/python3.10
  version: '3.10'
@lucascolley lucascolley added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Aug 23, 2023
@lucascolley
Copy link
Member Author

cc @mdhaber since this bug was discovered in #19023.

@j-bowhay j-bowhay added the array types Items related to array API support and input array validation (see gh-18286) label Aug 23, 2023
@mdhaber
Copy link
Contributor

mdhaber commented Aug 27, 2023

https://scipy.github.io/devdocs/dev/api-dev/array_api.html currently says about the flag:

This both enables array API standard support and the more strict input validation for array-like arguments.

And that the standard

allows users to use any Array API compatible array library with SciPy

Is that what we want it to do? That is, without the flag, original behavior is maintained - we always use vanilla NumPy - and with the flag, we use array_namespace to use the right array library and do more strict input validation?

One potential solution is to attempt to use np.asarray in compliance_scipy on types unrecognised by array-api-compat

Maybe, but maybe it should be asanyarray, or whatever the function used before it was updated to use array_namespace. For backward compatibility, it shouldn't suddenly start converting matrices and naked arrays to arrays.

@lucascolley
Copy link
Member Author

Looking at the code, the refusal of array-likes seems intentional to me. However, the RFC is pretty clear:

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray.

But yes, if it is the case that some array-likes are not currently "converted to NumPy arrays", then this quotation seems to just be saying 'preserve the current behaviour'.

The parts of linalg that I have been looking at recently do seem to use np.asarray rather than np.asanyarray, but if this is not the case for the whole project, then some more thought may need to be put in to how we realise this "more strict input validation". It would certainly be nice to have consistent validation across the whole project, but I'm not sure how difficult this will be to achieve, when backwards-compatibility and release-cycles are considered.

@tupui
Copy link
Member

tupui commented Aug 27, 2023

It's was indeed an intentional behavior as Matt describes.

We had a few iterations on that during the first PR and said that the new behavior, flag ON, should only accept standard arrays. Hence fail with array like and we don't coerce (also coercing would mean that we would make an implicit decision in case of GPU).

Often times, we see that in tests, the use of lists is just a convenience and not really motivated.

(I am not opposed to any change, just explaining the background of the current state.)

@rgommers
Copy link
Member

But yes, if it is the case that some array-likes are not currently "converted to NumPy arrays", then this quotation seems to just be saying 'preserve the current behaviour'.

Indeed. I could have spelled this out even clearer in the RFC I think (using the word "array-like"), but my intent was not to change how array-like's work today. I only intended for known-problematic cases like np.matrix and masked arrays to be rejected. Not allowing lists and other commonly used input types would be a very large backwards compatibility break; probably large enough to make it much more difficult to transition to the new behavior as the default one.

or whatever the function used before it was updated to use array_namespace. For backward compatibility,

Either this, or we may want to use asanyarray consistently in other to allow subclasses now that those are made "safe". A I had the former in mind, but both would be reasonable.

We had a few iterations on that during the first PR and said that the new behavior, flag ON, should only accept standard arrays. Hence fail with array like

I think I missed that discussion during review of gh-18668, and I cannot find it back there. Maybe it's in one of the many resolved comments; that's pretty hard to search for.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 28, 2023

masked arrays to be rejected

While we're on this again, can we add parameters to array_namespace to allow masked arrays and object arrays? Many stats functions have special support for masked arrays, and object arrays that contain mpmath numbers are useful.

@rgommers
Copy link
Member

Those are two separate topics, which I think need qualitatively different solutions:

  • Object arrays are for testing purposes only, and you'd want them for (potentially) any APIs, right? You're mostly interested in stats and special, but it could be useful elsewhere too for testing. I think this is probably best done with a private environment variable that's only used inside of the if is_object_array: check.
  • The APIs in stats.mstats that explicitly support masked arrays will need something more robust probably - indeed either a modification to array_namespace or a separate function to replace array_namespace.
    • Also, do we have functions in stats that need to support masked arrays? I'm asking about API, not implementation. I'm aware that there's shared implementations between stats and mstats, but I think that ideally all the masked array accepting public APIs would be in mstats only.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 28, 2023

Object Arrays

Object arrays are for testing purposes only,

Initially, that's fine. Eventually, I think there is value in making this public.

It is time consuming to ensure that functions return accurate results for all valid inputs. For example, I just merged gh-18714 after dozens of similar PRs. Going forward, I think it is more efficient to use arbitrary precision arithmetic to help us measure and document the reliable domains of float64 implementations, and we'd allow users to rely on arbitrary precision arithmetic for values outside the documented domain. The reliable domains of float64 implementations can be expanded as time permits (e.g. when there are no bugs to fix) or when there are both interested contributors and reviewers.

you'd want them for (potentially) any APIs, right.

Yes. It would be similar to the Array API. For some functions, it would be possible to support arbitrary precision end to end; for others, it would not, and we either raise an error (better, I think) or continue in finite precision.


Masked Arrays

I'm asking about API, not implementation.

Use of masked arrays for implementation (e.g. of NaN policy) has led to a lot of reported bugs. It can be done correctly, but it is not recommended because it has proven tricky to do so.

I think that ideally all the masked array accepting public APIs would be in mstats only.

That's fine if we can still declare mstats to be legacy. But resources are not currently sufficient to maintain parallel versions of scipy.stats, especially if the only reason to do so would be the moderate speed advantage of masked arrays in some cases (e.g. many short slices). If we need to continue to maintain functions that support masked arrays, the maintained versions should be the stats ones with the _axis_nan_policy wrapper.

@rgommers
Copy link
Member

If we need to continue to maintain functions that support masked arrays, the maintained versions should be the stats ones with the _axis_nan_policy wrapper.

That's an implementation rather than an API thing though. It looks to me like it'd be better to have a single implementation, and a separate set of APIs. In the stats namespace, support for masked arrays will always be patchy, so rather not have it at all. And instead have a clear rule: stats functions will raise on masked array input, stats.mstats functions will work correctly. There should not be much overhead in that split, and it's a lot easier to make consistent and explain to users.

especially if the only reason to do so would be the moderate speed advantage of masked arrays in some cases (e.g. many short slices)

Agreed. The other option would be to drop it completely. I'm not sure performance is the rationale for having masked arrays though - it seems more like a functionality thing to me than a performance thing.

@lucascolley
Copy link
Member Author

or whatever the function used before it was updated to use array_namespace. For backward compatibility,

Either this, or we may want to use asanyarray consistently in other to allow subclasses now that those are made "safe". A I had the former in mind, but both would be reasonable.

Okay, sounds like we do want a fix for this. Should I submit a PR of something like my draft but with asanyarray? Or do we want to explore the possibility of having different behaviour for different functions first?

@mdhaber
Copy link
Contributor

mdhaber commented Sep 5, 2023

@lucascolley Let's continue with something like your draft to keep this moving.
After that, would you be willing to submit a followup that would allow some other types of arrays through to certain functions (once we figure out what we want)?

@mdhaber
Copy link
Contributor

mdhaber commented Sep 5, 2023

It looks to me like it'd be better to have a single implementation, and a separate set of APIs

OK, we can consider that in a separate thread once we start working on this again. Currently, there are a lot of stats functions that already accept masked arrays, though. (For some, this is documented; for many, it is not.)

I'm not sure performance is the rationale for having masked arrays though - it seems more like a functionality thing to me than a performance thing.

My comment about performance was about whether the implementation actually uses NumPy's masked array functionality to perform the calculations. _axis_nan_policy accepts masked arrays, but it separate them into a data array and a mask, both of which are normal arrays. Then it loops over the slices and performs the regular stats (not mstats) function on each compressed slice. This avoids the need for a separate mstats implementation at the cost of performance in some cases. In other cases, the _axis_nan_policy version is comparable or faster, even with the overhead, because masked array tend to be pretty slow.

import numpy as np
from scipy import stats
rng = np.random.default_rng(8946529483563963)
n = 100000
x = rng.normal(size=n)
mask = rng.random(size=n) > 0.5
xm = np.ma.MaskedArray(x, mask)
%timeit stats.ttest_1samp(xm, 0)  # 2.82 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit stats.mstats.ttest_1samp(xm, 0)  # 3.84 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# for shorter slices, or if there are only a few masked elements, mstats can be a little faster
# but I suspect that consistent, correct functionality is more important than the speed

But yes, my assumption has been that if we retain the ability to work with masked arrays, we can't guarantee that these will be as fast as running the function on regular arrays.

@rgommers
Copy link
Member

rgommers commented Sep 5, 2023

[...] But yes, my assumption has been that if we retain the ability to work with masked arrays, we can't guarantee that these will be as fast as running the function on regular arrays.

Fully agreed. Using masked arrays causes a function to do extra work, so it's going to be slower in many cases. And we have to deal with suboptimal infrastructure from numpy here.

OK, we can consider that in a separate thread once we start working on this again.

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) defect A clear bug or issue that prevents SciPy from being installed or used as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants