Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make more of the "tools" of scikit-learn Array API compatible #26024

Open
betatim opened this issue Mar 30, 2023 · 34 comments
Open

Make more of the "tools" of scikit-learn Array API compatible #26024

betatim opened this issue Mar 30, 2023 · 34 comments
Labels
API Array API Meta-issue General issue associated to an identified list of tasks

Comments

@betatim
Copy link
Member

betatim commented Mar 30, 2023

🚨 🚧 This issue requires a bit of patience and experience to contribute to 🚧 🚨

Please mention this issue when you create a PR, but please don't write "closes #26024" or "fixes #26024".

scikit-learn contains lots of useful tools, in addition to the many estimators it has. For example metrics, pipelines, pre-processing and mode selection. These are useful to and used by people who do not necessarily use an estimator from scikit-learn. This is great.

The fact that many users install scikit-learn "just" to use train_test_split is a testament to how useful it is to provide easy to use tools that do the right(!) thing. Instead of everyone implementing them from scratch because it is "easy" and making mistakes along the way.

In this issue I'd like to collect and track work related to making it easier to use all these "tools" from scikit-learn even if you are not using Numpy arrays for your data. In particular thanks to the Array API standard it should be "not too much work" to make things usable with data that is in an array that conforms to the Array API standard.

There is work in #25956 and #22554 which adds the basic infrastructure needed to use "array API arrays". Right now you need to checkout #25956 (this is part of the reason why this is a draft issue).

The goal of this issue is to make code like the following work:

>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn import config_context
>>> from sklearn.datasets import make_classification
>>> import torch
>>> X_np, y_np = make_classification(random_state=0)
>>> X_torch = torch.asarray(X_np, device="cuda", dtype=torch.float32)
>>> y_torch = torch.asarray(y_np, device="cuda", dtype=torch.float32)

>>> with config_context(array_api_dispatch=True):
...     # For example using MinMaxScaler on PyTorch tensors
...     scale = MinMaxScaler()
...     X_trans = scale.fit_transform(X_torch, y_torch)
...     assert type(X_trans) == type(X_torch)
...     assert X_trans.device == X_torch.device

The first step is to compile a list of tools that are in scope for this. The next step (or maybe part of the first) is to check which of them already "just work". After that is done we can start the work (one PR per class/function) making changes. Hopefully by then #25956 is ready or already merged.


  • The main reason I created this issue is to get some feedback, let people know I'm thinking about this and to have a place to work on this/collect notes. If you want to join in that would be great, but for now there is no need for code changes/PRs.
@github-actions github-actions bot added the Needs Triage Issue requires triage label Mar 30, 2023
@betatim
Copy link
Member Author

betatim commented Mar 30, 2023

Mark an estimator or function as done if it not only "doesn't raise an exception" but also outputs a sensible value. The latter is something that will require a human at the start, but maybe later we can write a test for it.

Below a list of preprocessors and metrics. The lists are pretty long already, so I won't add more stuff until we make progress (or decide that a different area is a better starting point).

The next thing to work on is to work out if there is some generic advice around "fixing" these.

NOTE: it's possible to test the changes in your pull request on a CUDA GPU host for free with the help of this notebook on Google Colab: https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c


Transformers from sklearn.preprocessing:

Details Code used to create the list:
for name, Trn in discovery.all_estimators(type_filter="transformer"):
    if Trn.__module__.startswith("sklearn.preprocessing."):
        with config_context(array_api_dispatch=True):
            tr = Trn()
            try:
                tr.fit_transform(X_torch, y_torch)
                print(f"* [ ] {name} - no exception with pytorch X")
            except:
                print(f"* [ ] {name}")

Metrics from sklearn.metrics:

Details
for name, func in discovery.all_functions():
    if func.__module__.startswith("sklearn.metrics."):
        with config_context(array_api_dispatch=True):
            try:
                func(y_torch, y_torch)
                print(f"* [ ] {name} - no exception with pytorch y")
            except:
                print(f"* [ ] {name}")

@thomasjpfan thomasjpfan added API RFC and removed Needs Triage Issue requires triage labels Apr 6, 2023
@betatim
Copy link
Member Author

betatim commented Apr 19, 2023

It turns out, it is more tricky than you think. For example in MinMaxScaler we compute the min/max of the features, ignoring NaNs. However, there doesn't seem to be a nice way to do this with the Array API (yet). Filed data-apis/array-api#621 to discuss it.

@betatim
Copy link
Member Author

betatim commented Apr 20, 2023

A more comprehensible (less sentence fragments) version of the below text is in https://github.com/scikit-learn/scikit-learn/pull/25956/files#r1172450244


Some thoughts: should we add a "compat layer" in scikit-learn to add things like nanmin to the namespace? Should we contribute it to the array-api-compat project? should we lobby for it to be part of the array api itself?

cc @thomasjpfan maybe you have thoughts/opinion and/or time to chat about this.

@ogrisel
Copy link
Member

ogrisel commented Apr 20, 2023

I am fine with starting with a private helper in scikit-learn and discussing with the maintainers of the array-api-compat project if they think that such extensions to the spec API can make it upstream into array-api-compat or not.

@thomasjpfan
Copy link
Member

Moving my comment from https://github.com/scikit-learn/scikit-learn/pull/25956/files#r1172771551 here regarding adding more methods to scikit-learn's compat layer:

Ultimately, we decided to add methods to the wrappers only when it is going to be in the Array API spec. If we add methods that is not part of the spec, then we would have a "enchanced Array API" namespace, which can be a little confusing to contributors. Concretely, they would see xp.nanmin, but it is not a part of the spec and discover scikit-learn extends the spec. On the other hand, with a private _nanmin(..., xp=xp) helper function, it is clear that we are implementing something that is not a part of the spec.

@betatim betatim added Array API and removed RFC labels May 4, 2023
@ogrisel
Copy link
Member

ogrisel commented May 17, 2023

As discussed in data-apis/array-api#627 (related to a potential xp.linalg.lu), this could include things that seems to be long term objectives even if some libraries such as numpy still do not expose an implementation.

@EdAbati
Copy link
Contributor

EdAbati commented Aug 18, 2023

Hi all, I am working on MaxAbsScaler :)

@elindgren
Copy link
Contributor

I've just added a PR for the r2_score, #27102 :>

@OmarManzoor
Copy link
Contributor

Hi @betatim , @ogrisel

I would like to try working on f1_score. I think since precision and recall are related by the same underlying function they should be handled at the same time too?

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2023

Yes, that would make sense.

@EdAbati
Copy link
Contributor

EdAbati commented Aug 21, 2023

I'll try converting zero_one_loss.

@betatim betatim changed the title [Draft] Make more of the "tools" of scikit-learn Array API compatible Make more of the "tools" of scikit-learn Array API compatible Aug 22, 2023
@rotuna
Copy link

rotuna commented Sep 21, 2023

Hi!

Inspired by the lightning talk at the Swiss python summit.

I'll work on the OneHotEncoder

@rotuna
Copy link

rotuna commented Sep 26, 2023

OneHotEncoder Seems to work fine with the API.

The following is what I used to test it. (It's basically the same as the first example in the docs for the same.

>>> from sklearn.preprocessing import OneHotEncoder
>>> import numpy.array_api as xp
>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = xp.asarray([[1, 1], [2, 3], [2, 2]])
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array([1, 2]), array([1, 2, 3])]
>>> enc.transform(xp.asarray([[2, 1], [1, 4]])).toarray()
array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 0.]])
>>> enc.inverse_transform([[1., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]])
array([[1, 1],
       [2, None]], dtype=object)
>>> enc.get_feature_names_out(['g1', 'g2'])
array(['g1_1', 'g1_2', 'g2_1', 'g2_2', 'g2_3'], dtype=object)

I'll check Binarizer next, unless you think there is something else that needs to be checked for OneHotEncoder @betatim

I don't really have access to a GPU to check this with CuPy, I hope that's not a problem

@EdAbati
Copy link
Contributor

EdAbati commented Sep 27, 2023

Hey @rotuna,
(disclaimer: I am not part of the sklearn team, therefore I may say something wrong 😄 but I have done some PRs for this issue)

I think that we should also test that OneHotEncoder works with other types of arrays (I test locally using pytorch with the device "cpu" and "mps").
There is a common test that should be used for estimator and you can find it here:

def test_scaler_array_api_compliance(estimator, check, array_namespace, device, dtype):

I'd start by adding the estimator to the list there and see if something fails. If you are lucky that everything works and there are no numpy specific operations in the implementation, I think you can just make a PR by adding OneHotEncoder to that test :)

@EdAbati
Copy link
Contributor

EdAbati commented Oct 9, 2023

I am working on KernelCenterer and Normalizer, PRs coming soon

@Tialo
Copy link
Contributor

Tialo commented May 24, 2024

I was working on entropy and when I used torch.tensor (with and without set_config(array_api_dispatch=True), it did not raise exceptions even without changing the code. Should the function be left in current state, and only test should be added? Or I should replace np with xp anyway?

Also when I used get_namespace and xp namespace it only added some overhead, slowing the function (anyway it's fast).

@OmarManzoor
Copy link
Contributor

I would like to try out euclidean_distances and rbf_kernel from the set of pairwise metrics, after the current set of metrics that I am working on are finalized.

@ogrisel
Copy link
Member

ogrisel commented Jun 6, 2024

For information, I edited the above comment with the list of estimators / function to focus on to add link to this notebook that can be very helpful to debug failing tests on CUDA GPU for free using Google Colab or similar:

@EmilyXinyi
Copy link
Contributor

Hi! I would like to work on d2_tweedie_score ! Thanks!

@EmilyXinyi
Copy link
Contributor

Working on mean_poisson_deviance and cosine_distance :)

@EdAbati
Copy link
Contributor

EdAbati commented Jun 7, 2024

Working on max_error :)

@EmilyXinyi
Copy link
Contributor

Looking at mean_gamma_deviance :)

@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2024

@elindgren @lithomas1 @EdAbati @Tialo @EmilyXinyi since you all had the experience of having some array API PRs already merged in main, feel free to review each other's PRs to look for improvements similar to those you received in reviews to your own past PRs.

@EmilyXinyi
Copy link
Contributor

Working on mean_absolute_percentage_error :)

@EdAbati
Copy link
Contributor

EdAbati commented Jun 23, 2024

Working on dcg_score :)

@OmarManzoor
Copy link
Contributor

@ogrisel Are we supposed to handle the latest version of array-api-strict which is 2.0, because some tests are now failing

FAILED sklearn/model_selection/tests/test_search.py::test_array_api_search_cv_classifier[GridSearchCV-array_api_strict-None-None] - ValueError: 
FAILED sklearn/model_selection/tests/test_search.py::test_array_api_search_cv_classifier[RandomizedSearchCV-array_api_strict-None-None] - ValueError: 
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y0-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y1-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y2-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/tests/test_common.py::test_estimators[LinearDiscriminantAnalysis()-check_array_api_input(array_namespace=array_api_strict,dtype_name=None,device=None)] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict

@ogrisel
Copy link
Member

ogrisel commented Jun 28, 2024

@ogrisel Are we supposed to handle the latest version of array-api-strict which is 2.0, because some tests are now failing

@OmarManzoor Interesting. So far the currently opened PRs run with the version (1.1.1) from the lock files of the CI:

But indeed our lock file bot will attempt to open a PR to bump up the versions of the dependencies on Monday and this will fail with the error your reported so feel free to open a dedicated PR to start fixing those.

You can already trigger the update of the lock file for pylatest_conda_forge_mkl_linux in such a PR by using https://github.com/scikit-learn/scikit-learn/blob/main/build_tools/update_environments_and_lock_files.py with the appropriate --select-build flag. I let you read the doc at beginning of that script for details.

@EmilyXinyi
Copy link
Contributor

Looking at paired_euclidean_distances :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Array API Meta-issue General issue associated to an identified list of tasks
Projects
Status: Todo
Development

No branches or pull requests