Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid np.asarray call in check_array for duck-typed arrays #11447

Open
mrocklin opened this issue Jul 5, 2018 · 17 comments
Open

Avoid np.asarray call in check_array for duck-typed arrays #11447

mrocklin opened this issue Jul 5, 2018 · 17 comments

Comments

@mrocklin
Copy link

mrocklin commented Jul 5, 2018

Would it be reasonable to avoid np.asarray calls in check_array if the input is array-like?

My guess is that the general answer is "No. The np.asarray call is important to guarantee consistency within scikit-learn's estimators. We strongly value consistency."

The context here is that after the recently merged #11308 , some scikit-learn transformers like RobustScalar that used to work on dask arrays no longer work because they auto-coerce their inputs into numpy arrays. This likely comes up in other situations as well.

@jnothman
Copy link
Member

jnothman commented Jul 6, 2018 via email

@mrocklin
Copy link
Author

mrocklin commented Jul 6, 2018

Dask is fine, Dask-ML had issues. It looks like the Dask-ML RobustScalar inherits from Scikit-Learn's and reuses the transform method (I guess it previously only used dask-array comptible opeations.

Inside check_array there are a few calls to np.asarray, which causes a dask array to become a numpy array.

@mrocklin
Copy link
Author

mrocklin commented Jul 6, 2018

It looks like RobustScalar has a _check_array method. Perhaps by relying on class methods we can allow downstream projects to define behavior.

@mrocklin
Copy link
Author

mrocklin commented Jul 6, 2018

Ah, no, I was confused. My mistake.

@jnothman
Copy link
Member

jnothman commented Jul 7, 2018 via email

@mrocklin
Copy link
Author

mrocklin commented Jul 7, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jul 7, 2018

See also some related discussion at #11043

I do think this is an issue we should be dealing with. I would be very interested in seeing a PR which:

  • modified check_array to not cast dask arrays to numpy arrays (by ducktyping or subtype checking, I don't mind for an initial proof of concept)
  • modified sklearn/utils/estimator_checks.py to pass dask arrays into estimators for fitting and prediction (perhaps read_only_memmap is a fair model?)
  • included a dask dependency in at least one CI run

I'd like to see if we can get tests passing out of the box, or whether this needs to be enabled on a per-estimator basis. I'd like to see whether there's a sensible way to duck type this, or whether we need to make specific exceptions for a while.

I don't think you should expect this support for the coming release, unfortunately.

@mrocklin
Copy link
Author

mrocklin commented Jul 7, 2018

No worries about coming release. I'm playing a long game here with Dask/SKLearn interactions :) The improved joblib/dask interaction is something I'm eagerly waiting to be in a sklearn release, so I would be sad to hold things up.

I can probably find someone to help with the check_array PR. Maybe this is a good scipy sprint task.

I'm not sure I understand the estimator_checks comment. There will be many cases where passing through dask arrays definitely won't make sense (any time cython code assumes the numpy memory model for example). My guess is that checks applied to all estimators would raise too many special case problems. I may be misinterpreting this comment.

Also cc'ing @TomAugspurger

@jnothman
Copy link
Member

jnothman commented Jul 7, 2018 via email

@jorisvandenbossche
Copy link
Member

something like accept={'array','array-like','frame-like'} in check_array...

Yes, I think we need to have a general discussion about check_array, since a similar question is how to deal with dataframes (although of course they are much less array-like), see also #11401 (comment)

@glemaitre
Copy link
Member

It could be a good start to have a CI build running the tests of dask-ml. We could detect which thing we are breaking.

@jakirkham
Copy link
Contributor

Was just about to open an issue like this. Thankfully someone beat me to it by a year. 😄

Simply replace Dask with CuPy and dask-ml with cuML to get the issue I'm encountering.

Agree it would be great to support array-like things in scikit-learn. What would be our next steps here?

@mrocklin
Copy link
Author

My guess is that @jnothman 's comment is likely still valid.

For testing my guess is that we could use dask.array as a stand-in for numpy-array-like and that that would probably satisfy things for CuPy.

@amueller
Copy link
Member

Also see #14702 for us going in the opposite direction ;)

@jakirkham can you provide an example for estimators that you're interested in?
I assume that for 90% of scikit-learn this would make no sense. Some of the preprocessing methods might work, but I would be surprised if any of the machine learning would work with cupy. Even if it did, it would be much much slower than the GPU-based implementations that are available in other places.

@amueller
Copy link
Member

amueller commented Jan 9, 2020

In the interest of seeing what this would buy us, I made a list of estimators where this could potentially help:
https://docs.google.com/spreadsheets/d/1LCItphTxKwJWERUehItK_Tacef8jkXAxGyJQEvR3xcU/edit?usp=sharing

The first column is the ones that do not call into cython or scipy optimize, though they might use scipy.linalg or some scipy distributions, but these might be fixable.

second column calls scipy.optimize and is unlikely to be fixable with nep13 and nep18 and third column is definitely impossible without a rewrite as it makes substantial use of Cython.
There's also some places where some trickery might be needed / useful. For example in the neighbors, we can easily support 'brute' so we should probably default to that if the object is a nep18 array. Similarly for NMF we can easily support the multiplicative updates, so we might want to default to that for these data types.

And "of course" we need to use numpy.linalg whenever we want to support these arrays, not scipy.linalg, meaning we likely have to wrap the common linalg methods to do the dispatch.

cc @thomasjpfan

@jakirkham
Copy link
Contributor

Thanks for putting that list together @amueller!

I think what is most interesting to us is what you have called meta-estimators in the spreadsheet (column 4). These are interesting as we can combine them with things like Dask, CuPy, and/or other things where the array type is handled correctly by some estimator we provide. Just to pick one to start out, GridSearchCV could be a nice place to start and get a feel for what this takes.

Though it is also interesting to see a fair number of estimators fit in column 1. So it's good to hear that those might be usable with other array types. It might be interesting to explore this later on.

Does this seem like a reasonable place to start?

cc @JohnZed

@amueller
Copy link
Member

amueller commented Feb 5, 2020

@jakirkham yes, that sounds good. Sorry for the slow reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants