Avoid np.asarray call in check_array for duck-typed arrays #11447

mrocklin · 2018-07-05T23:34:08Z

Would it be reasonable to avoid np.asarray calls in check_array if the input is array-like?

My guess is that the general answer is "No. The np.asarray call is important to guarantee consistency within scikit-learn's estimators. We strongly value consistency."

The context here is that after the recently merged #11308 , some scikit-learn transformers like RobustScalar that used to work on dask arrays no longer work because they auto-coerce their inputs into numpy arrays. This likely comes up in other situations as well.

jnothman · 2018-07-06T03:13:26Z

why did that pr change the effect on dask? We currently try not to cast memmaps to default arrays, so I suppose things should work for dask too? We should consider explicitly testing our handling of dask. But how to use duck-typing? A problem is that pandas DataFrames are array-like but have different indexing semantics.

mrocklin · 2018-07-06T11:09:11Z

Dask is fine, Dask-ML had issues. It looks like the Dask-ML RobustScalar inherits from Scikit-Learn's and reuses the transform method (I guess it previously only used dask-array comptible opeations.

Inside check_array there are a few calls to np.asarray, which causes a dask array to become a numpy array.

mrocklin · 2018-07-06T11:49:57Z

It looks like RobustScalar has a _check_array method. Perhaps by relying on class methods we can allow downstream projects to define behavior.

mrocklin · 2018-07-06T11:51:36Z

Ah, no, I was confused. My mistake.

jnothman · 2018-07-07T11:04:57Z

Means what.... what's the status?

mrocklin · 2018-07-07T11:50:19Z

We re-implemented the transform and inverse transform methods in dask-ml's RobustScalar class. Previously we shared the implementation. In terms of immediate impact this isn't a big deal. My main objective behind this issue is to raise the topic of avoiding the coercion of input to numpy arrays when it isn't strictly necessary.

…

On Sat, Jul 7, 2018 at 7:05 AM Joel Nothman ***@***.***> wrote: Means what.... what's the status? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#11447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLK74TIGYDTSiQoxF_-zI-CRel0fks5uEJYFgaJpZM4VErc9> .

jnothman · 2018-07-07T13:41:24Z

See also some related discussion at #11043

I do think this is an issue we should be dealing with. I would be very interested in seeing a PR which:

modified check_array to not cast dask arrays to numpy arrays (by ducktyping or subtype checking, I don't mind for an initial proof of concept)
modified sklearn/utils/estimator_checks.py to pass dask arrays into estimators for fitting and prediction (perhaps read_only_memmap is a fair model?)
included a dask dependency in at least one CI run

I'd like to see if we can get tests passing out of the box, or whether this needs to be enabled on a per-estimator basis. I'd like to see whether there's a sensible way to duck type this, or whether we need to make specific exceptions for a while.

I don't think you should expect this support for the coming release, unfortunately.

mrocklin · 2018-07-07T18:21:57Z

No worries about coming release. I'm playing a long game here with Dask/SKLearn interactions :) The improved joblib/dask interaction is something I'm eagerly waiting to be in a sklearn release, so I would be sad to hold things up.

I can probably find someone to help with the check_array PR. Maybe this is a good scipy sprint task.

I'm not sure I understand the estimator_checks comment. There will be many cases where passing through dask arrays definitely won't make sense (any time cython code assumes the numpy memory model for example). My guess is that checks applied to all estimators would raise too many special case problems. I may be misinterpreting this comment.

Also cc'ing @TomAugspurger

jnothman · 2018-07-07T23:15:51Z

usually we would aim to check that all estimators either behave consistently or raise an informative error. but I suspect we will find ourselves implementing something like accept={'array','array-like','frame-like'} in check_array...

jorisvandenbossche · 2018-07-08T03:08:55Z

something like accept={'array','array-like','frame-like'} in check_array...

Yes, I think we need to have a general discussion about check_array, since a similar question is how to deal with dataframes (although of course they are much less array-like), see also #11401 (comment)

glemaitre · 2018-07-09T16:25:39Z

It could be a good start to have a CI build running the tests of dask-ml. We could detect which thing we are breaking.

jakirkham · 2019-09-10T00:09:06Z

Was just about to open an issue like this. Thankfully someone beat me to it by a year. 😄

Simply replace Dask with CuPy and dask-ml with cuML to get the issue I'm encountering.

Agree it would be great to support array-like things in scikit-learn. What would be our next steps here?

mrocklin · 2019-09-10T00:15:52Z

My guess is that @jnothman 's comment is likely still valid.

For testing my guess is that we could use dask.array as a stand-in for numpy-array-like and that that would probably satisfy things for CuPy.

amueller · 2019-09-10T16:57:42Z

Also see #14702 for us going in the opposite direction ;)

@jakirkham can you provide an example for estimators that you're interested in?
I assume that for 90% of scikit-learn this would make no sense. Some of the preprocessing methods might work, but I would be surprised if any of the machine learning would work with cupy. Even if it did, it would be much much slower than the GPU-based implementations that are available in other places.

amueller · 2020-01-09T19:20:19Z

In the interest of seeing what this would buy us, I made a list of estimators where this could potentially help:
https://docs.google.com/spreadsheets/d/1LCItphTxKwJWERUehItK_Tacef8jkXAxGyJQEvR3xcU/edit?usp=sharing

The first column is the ones that do not call into cython or scipy optimize, though they might use scipy.linalg or some scipy distributions, but these might be fixable.

second column calls scipy.optimize and is unlikely to be fixable with nep13 and nep18 and third column is definitely impossible without a rewrite as it makes substantial use of Cython.
There's also some places where some trickery might be needed / useful. For example in the neighbors, we can easily support 'brute' so we should probably default to that if the object is a nep18 array. Similarly for NMF we can easily support the multiplicative updates, so we might want to default to that for these data types.

And "of course" we need to use numpy.linalg whenever we want to support these arrays, not scipy.linalg, meaning we likely have to wrap the common linalg methods to do the dispatch.

cc @thomasjpfan

jakirkham · 2020-01-09T22:25:57Z

Thanks for putting that list together @amueller!

I think what is most interesting to us is what you have called meta-estimators in the spreadsheet (column 4). These are interesting as we can combine them with things like Dask, CuPy, and/or other things where the array type is handled correctly by some estimator we provide. Just to pick one to start out, GridSearchCV could be a nice place to start and get a feel for what this takes.

Though it is also interesting to see a fair number of estimators fit in column 1. So it's good to hear that those might be usable with other array types. It might be interesting to explore this later on.

Does this seem like a reasonable place to start?

cc @JohnZed

amueller · 2020-02-05T16:05:59Z

@jakirkham yes, that sounds good. Sorry for the slow reply.

jorisvandenbossche mentioned this issue Sep 24, 2018

Improvements to check_array to handle heterogenous / object data #12148

Open

adrinjalali mentioned this issue Aug 20, 2019

Classifiers may not work with arrays defining __array_function__ #14687

Closed

amueller mentioned this issue Sep 23, 2019

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Closed

thomasjpfan mentioned this issue Feb 27, 2020

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

Draft

cmarmo added the module:utils label Jan 17, 2022

ogrisel added the Enhancement label Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid np.asarray call in check_array for duck-typed arrays #11447

Avoid np.asarray call in check_array for duck-typed arrays #11447

mrocklin commented Jul 5, 2018

jnothman commented Jul 6, 2018 via email

mrocklin commented Jul 6, 2018

mrocklin commented Jul 6, 2018

mrocklin commented Jul 6, 2018

jnothman commented Jul 7, 2018 via email

mrocklin commented Jul 7, 2018 via email

jnothman commented Jul 7, 2018

mrocklin commented Jul 7, 2018

jnothman commented Jul 7, 2018 via email

jorisvandenbossche commented Jul 8, 2018

glemaitre commented Jul 9, 2018

jakirkham commented Sep 10, 2019

mrocklin commented Sep 10, 2019

amueller commented Sep 10, 2019

amueller commented Jan 9, 2020

jakirkham commented Jan 9, 2020

amueller commented Feb 5, 2020

Avoid np.asarray call in check_array for duck-typed arrays #11447

Avoid np.asarray call in check_array for duck-typed arrays #11447

Comments

mrocklin commented Jul 5, 2018

jnothman commented Jul 6, 2018 via email

mrocklin commented Jul 6, 2018

mrocklin commented Jul 6, 2018

mrocklin commented Jul 6, 2018

jnothman commented Jul 7, 2018 via email

mrocklin commented Jul 7, 2018 via email

jnothman commented Jul 7, 2018

mrocklin commented Jul 7, 2018

jnothman commented Jul 7, 2018 via email

jorisvandenbossche commented Jul 8, 2018

glemaitre commented Jul 9, 2018

jakirkham commented Sep 10, 2019

mrocklin commented Sep 10, 2019

amueller commented Sep 10, 2019

amueller commented Jan 9, 2020

jakirkham commented Jan 9, 2020

amueller commented Feb 5, 2020