Make more of the "tools" of scikit-learn Array API compatible #26024

betatim · 2023-03-30T12:01:18Z

🚨 🚧 This issue requires a bit of patience and experience to contribute to 🚧 🚨

Original issue introducing array API in scikit-learn: Path for Adopting the Array API spec #22352
array API official doc/spec: https://data-apis.org/array-api/
scikit-learn doc: https://scikit-learn.org/dev/modules/array_api.html

Please mention this issue when you create a PR, but please don't write "closes #26024" or "fixes #26024".

scikit-learn contains lots of useful tools, in addition to the many estimators it has. For example metrics, pipelines, pre-processing and mode selection. These are useful to and used by people who do not necessarily use an estimator from scikit-learn. This is great.

The fact that many users install scikit-learn "just" to use train_test_split is a testament to how useful it is to provide easy to use tools that do the right(!) thing. Instead of everyone implementing them from scratch because it is "easy" and making mistakes along the way.

In this issue I'd like to collect and track work related to making it easier to use all these "tools" from scikit-learn even if you are not using Numpy arrays for your data. In particular thanks to the Array API standard it should be "not too much work" to make things usable with data that is in an array that conforms to the Array API standard.

There is work in #25956 and #22554 which adds the basic infrastructure needed to use "array API arrays". Right now you need to checkout #25956 (this is part of the reason why this is a draft issue).

The goal of this issue is to make code like the following work:

>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn import config_context
>>> from sklearn.datasets import make_classification
>>> import torch
>>> X_np, y_np = make_classification(random_state=0)
>>> X_torch = torch.asarray(X_np, device="cuda", dtype=torch.float32)
>>> y_torch = torch.asarray(y_np, device="cuda", dtype=torch.float32)

>>> with config_context(array_api_dispatch=True):
...     # For example using MinMaxScaler on PyTorch tensors
...     scale = MinMaxScaler()
...     X_trans = scale.fit_transform(X_torch, y_torch)
...     assert type(X_trans) == type(X_torch)
...     assert X_trans.device == X_torch.device

The first step is to compile a list of tools that are in scope for this. The next step (or maybe part of the first) is to check which of them already "just work". After that is done we can start the work (one PR per class/function) making changes. Hopefully by then #25956 is ready or already merged.

The main reason I created this issue is to get some feedback, let people know I'm thinking about this and to have a place to work on this/collect notes. If you want to join in that would be great, but for now there is no need for code changes/PRs.

The text was updated successfully, but these errors were encountered:

betatim · 2023-03-30T15:34:51Z

Mark an estimator or function as done if it not only "doesn't raise an exception" but also outputs a sensible value. The latter is something that will require a human at the start, but maybe later we can write a test for it.

Below a list of preprocessors and metrics. The lists are pretty long already, so I won't add more stuff until we make progress (or decide that a different area is a better starting point).

The next thing to work on is to work out if there is some generic advice around "fixing" these.

NOTE: it's possible to test the changes in your pull request on a CUDA GPU host for free with the help of this notebook on Google Colab: https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c

Transformers from sklearn.preprocessing:

Details

Code used to create the list:

for name, Trn in discovery.all_estimators(type_filter="transformer"):
    if Trn.__module__.startswith("sklearn.preprocessing."):
        with config_context(array_api_dispatch=True):
            tr = Trn()
            try:
                tr.fit_transform(X_torch, y_torch)
                print(f"* [ ] {name} - no exception with pytorch X")
            except:
                print(f"* [ ] {name}")

Metrics from sklearn.metrics:

Details

for name, func in discovery.all_functions():
    if func.__module__.startswith("sklearn.metrics."):
        with config_context(array_api_dispatch=True):
            try:
                func(y_torch, y_torch)
                print(f"* [ ] {name} - no exception with pytorch y")
            except:
                print(f"* [ ] {name}")

betatim · 2023-04-19T11:55:32Z

It turns out, it is more tricky than you think. For example in MinMaxScaler we compute the min/max of the features, ignoring NaNs. However, there doesn't seem to be a nice way to do this with the Array API (yet). Filed data-apis/array-api#621 to discuss it.

betatim · 2023-04-20T09:14:59Z

A more comprehensible (less sentence fragments) version of the below text is in https://github.com/scikit-learn/scikit-learn/pull/25956/files#r1172450244

Some thoughts: should we add a "compat layer" in scikit-learn to add things like nanmin to the namespace? Should we contribute it to the array-api-compat project? should we lobby for it to be part of the array api itself?

cc @thomasjpfan maybe you have thoughts/opinion and/or time to chat about this.

ogrisel · 2023-04-20T12:41:17Z

I am fine with starting with a private helper in scikit-learn and discussing with the maintainers of the array-api-compat project if they think that such extensions to the spec API can make it upstream into array-api-compat or not.

thomasjpfan · 2023-04-27T17:38:55Z

Moving my comment from https://github.com/scikit-learn/scikit-learn/pull/25956/files#r1172771551 here regarding adding more methods to scikit-learn's compat layer:

Ultimately, we decided to add methods to the wrappers only when it is going to be in the Array API spec. If we add methods that is not part of the spec, then we would have a "enchanced Array API" namespace, which can be a little confusing to contributors. Concretely, they would see xp.nanmin, but it is not a part of the spec and discover scikit-learn extends the spec. On the other hand, with a private _nanmin(..., xp=xp) helper function, it is clear that we are implementing something that is not a part of the spec.

ogrisel · 2023-05-17T12:13:24Z

As discussed in data-apis/array-api#627 (related to a potential xp.linalg.lu), this could include things that seems to be long term objectives even if some libraries such as numpy still do not expose an implementation.

EdAbati · 2023-08-18T12:34:17Z

Hi all, I am working on MaxAbsScaler :)

elindgren · 2023-08-18T12:39:12Z

I've just added a PR for the r2_score, #27102 :>

OmarManzoor · 2023-08-21T11:31:12Z

Hi @betatim , @ogrisel

I would like to try working on f1_score. I think since precision and recall are related by the same underlying function they should be handled at the same time too?

ogrisel · 2023-08-21T11:38:30Z

Yes, that would make sense.

EdAbati · 2023-08-21T21:29:08Z

I'll try converting zero_one_loss.

rotuna · 2023-09-21T15:55:13Z

Hi!

Inspired by the lightning talk at the Swiss python summit.

I'll work on the OneHotEncoder

rotuna · 2023-09-26T13:31:06Z

OneHotEncoder Seems to work fine with the API.

The following is what I used to test it. (It's basically the same as the first example in the docs for the same.

>>> from sklearn.preprocessing import OneHotEncoder
>>> import numpy.array_api as xp
>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = xp.asarray([[1, 1], [2, 3], [2, 2]])
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array([1, 2]), array([1, 2, 3])]
>>> enc.transform(xp.asarray([[2, 1], [1, 4]])).toarray()
array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 0.]])
>>> enc.inverse_transform([[1., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]])
array([[1, 1],
       [2, None]], dtype=object)
>>> enc.get_feature_names_out(['g1', 'g2'])
array(['g1_1', 'g1_2', 'g2_1', 'g2_2', 'g2_3'], dtype=object)

I'll check Binarizer next, unless you think there is something else that needs to be checked for OneHotEncoder @betatim

I don't really have access to a GPU to check this with CuPy, I hope that's not a problem

EdAbati · 2023-09-27T10:52:16Z

Hey @rotuna,
(disclaimer: I am not part of the sklearn team, therefore I may say something wrong 😄 but I have done some PRs for this issue)

I think that we should also test that OneHotEncoder works with other types of arrays (I test locally using pytorch with the device "cpu" and "mps").
There is a common test that should be used for estimator and you can find it here:

scikit-learn/sklearn/preprocessing/tests/test_data.py

Line 697 in 457b02c

    
           def test_scaler_array_api_compliance(estimator, check, array_namespace, device, dtype):

I'd start by adding the estimator to the list there and see if something fails. If you are lucky that everything works and there are no numpy specific operations in the implementation, I think you can just make a PR by adding OneHotEncoder to that test :)

EdAbati · 2023-10-09T16:08:24Z

I am working on KernelCenterer and Normalizer, PRs coming soon

Tialo · 2024-05-24T23:21:19Z

I was working on entropy and when I used torch.tensor (with and without set_config(array_api_dispatch=True), it did not raise exceptions even without changing the code. Should the function be left in current state, and only test should be added? Or I should replace np with xp anyway?

Also when I used get_namespace and xp namespace it only added some overhead, slowing the function (anyway it's fast).

OmarManzoor · 2024-05-27T07:37:44Z

I would like to try out euclidean_distances and rbf_kernel from the set of pairwise metrics, after the current set of metrics that I am working on are finalized.

ogrisel · 2024-06-06T08:30:19Z

For information, I edited the above comment with the list of estimators / function to focus on to add link to this notebook that can be very helpful to debug failing tests on CUDA GPU for free using Google Colab or similar:

https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c

EmilyXinyi · 2024-06-06T14:56:18Z

Hi! I would like to work on d2_tweedie_score ! Thanks!

EmilyXinyi · 2024-06-07T13:09:09Z

Working on mean_poisson_deviance and cosine_distance :)

EdAbati · 2024-06-07T17:09:05Z

Working on max_error :)

EmilyXinyi · 2024-06-11T15:17:08Z

Looking at mean_gamma_deviance :)

ogrisel · 2024-06-14T16:19:00Z

@elindgren @lithomas1 @EdAbati @Tialo @EmilyXinyi since you all had the experience of having some array API PRs already merged in main, feel free to review each other's PRs to look for improvements similar to those you received in reviews to your own past PRs.

EmilyXinyi · 2024-06-19T10:28:41Z

Working on mean_absolute_percentage_error :)

EdAbati · 2024-06-23T08:35:06Z

Working on dcg_score :)

OmarManzoor · 2024-06-28T12:12:01Z

@ogrisel Are we supposed to handle the latest version of array-api-strict which is 2.0, because some tests are now failing

FAILED sklearn/model_selection/tests/test_search.py::test_array_api_search_cv_classifier[GridSearchCV-array_api_strict-None-None] - ValueError: 
FAILED sklearn/model_selection/tests/test_search.py::test_array_api_search_cv_classifier[RandomizedSearchCV-array_api_strict-None-None] - ValueError: 
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y0-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y1-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y2-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/tests/test_common.py::test_estimators[LinearDiscriminantAnalysis()-check_array_api_input(array_namespace=array_api_strict,dtype_name=None,device=None)] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int16-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int32-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[int64-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-True-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-True-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-False-True-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict
FAILED sklearn/utils/tests/test_array_api.py::test_isin[uint8-14-False-False-array_api_strict-None-None] - TypeError: array iteration is not allowed in array-api-strict

ogrisel · 2024-06-28T12:57:11Z

@ogrisel Are we supposed to handle the latest version of array-api-strict which is 2.0, because some tests are now failing

@OmarManzoor Interesting. So far the currently opened PRs run with the version (1.1.1) from the lock files of the CI:

scikit-learn/build_tools/azure/pylatest_conda_forge_mkl_linux-64_conda.lock

Line 209 in 4bc61a0

https://conda.anaconda.org/conda-forge/noarch/array-api-strict-1.1.1-pyhd8ed1ab_0.conda#941bbcd64d1a7b44aeb497f468fc85b4

But indeed our lock file bot will attempt to open a PR to bump up the versions of the dependencies on Monday and this will fail with the error your reported so feel free to open a dedicated PR to start fixing those.

You can already trigger the update of the lock file for pylatest_conda_forge_mkl_linux in such a PR by using https://github.com/scikit-learn/scikit-learn/blob/main/build_tools/update_environments_and_lock_files.py with the appropriate --select-build flag. I let you read the doc at beginning of that script for details.

EmilyXinyi · 2024-06-28T16:14:56Z

Looking at paired_euclidean_distances :)

github-actions bot added the Needs Triage Issue requires triage label Mar 30, 2023

thomasjpfan added API RFC and removed Needs Triage Issue requires triage labels Apr 6, 2023

betatim mentioned this issue Apr 20, 2023

ENH Adds PyTorch support to LinearDiscriminantAnalysis #25956

Merged

betatim mentioned this issue Apr 21, 2023

ENH Add Array API compatibility to MinMaxScaler #26243

Merged

betatim added Array API and removed RFC labels May 4, 2023

elindgren mentioned this issue Aug 18, 2023

Use Array API in r2_score #27102

Closed

EdAbati mentioned this issue Aug 19, 2023

ENH Add Array API compatibility to MaxAbsScaler #27110

Merged

AlexanderFabisch mentioned this issue Aug 19, 2023

Make standard scaler compatible to Array API #27113

Draft

betatim changed the title ~~[Draft] Make more of the "tools" of scikit-learn Array API compatible~~ Make more of the "tools" of scikit-learn Array API compatible Aug 22, 2023

EdAbati mentioned this issue Aug 22, 2023

ENH Add Array API compatibility to zero_one_loss and accuracy_score #27137

Merged

This was referenced Sep 14, 2023

ENH Array API support for f1_score and multilabel_confusion_matrix #27369

Open

ENH Array API support for LabelEncoder #27381

Merged

EdAbati mentioned this issue May 26, 2024

ENH Test Array API compatibility for paired_cosine_distances #29112

Merged

This was referenced May 30, 2024

ENH Add Array API compatibility for entropy #29141

Merged

ENH Add Array Api compatibility for mean_squared_error #29142

Merged

ENH Add Array API compatibility for additive_chi2_kernel #29144

Merged

EmilyXinyi mentioned this issue Jun 6, 2024

add array API support for d2_tweedie_score #29207

Merged

ogrisel mentioned this issue Jun 7, 2024

RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

Open

EdAbati mentioned this issue Jun 7, 2024

ENH Add array_api compatibility to max_error #29212

Merged

vitaliset mentioned this issue Jun 10, 2024

ENH Add sample_weight parameter to OneHotEncoder's .fit #26330

Open

EmilyXinyi mentioned this issue Jun 10, 2024

array API support for mean_poisson_deviance #29227

Open

EmilyXinyi mentioned this issue Jun 11, 2024

array API support for mean_gamma_deviance #29239

Merged

ogrisel mentioned this issue Jun 13, 2024

MNT _weighted_percentile supports np.nan values #29034

Open

Tialo mentioned this issue Jun 13, 2024

ENH Array API for contingency_matrix #29251

Open

This was referenced Jun 14, 2024

extreme_stable case for mean_tweedie_deviance #29258

Open

array API support for cosine_distances #29265

Open

Tialo mentioned this issue Jun 14, 2024

ENH Array API for chi2_kernel #29267

Merged

EmilyXinyi mentioned this issue Jun 19, 2024

array API support for mean_absolute_percentage_error #29300

Open

EmilyXinyi mentioned this issue Jul 2, 2024

adding array api test cases for paired_euclidean_distances #29389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make more of the "tools" of scikit-learn Array API compatible #26024

Make more of the "tools" of scikit-learn Array API compatible #26024

betatim commented Mar 30, 2023 •

edited by ogrisel

Loading

betatim commented Mar 30, 2023 •

edited by ogrisel

Loading

betatim commented Apr 19, 2023

betatim commented Apr 20, 2023 •

edited

Loading

ogrisel commented Apr 20, 2023

thomasjpfan commented Apr 27, 2023

ogrisel commented May 17, 2023 •

edited

Loading

EdAbati commented Aug 18, 2023

elindgren commented Aug 18, 2023

OmarManzoor commented Aug 21, 2023

ogrisel commented Aug 21, 2023

EdAbati commented Aug 21, 2023

rotuna commented Sep 21, 2023

rotuna commented Sep 26, 2023 •

edited

Loading

EdAbati commented Sep 27, 2023 •

edited

Loading

EdAbati commented Oct 9, 2023 •

edited

Loading

Tialo commented May 24, 2024

OmarManzoor commented May 27, 2024

ogrisel commented Jun 6, 2024

EmilyXinyi commented Jun 6, 2024

EmilyXinyi commented Jun 7, 2024

EdAbati commented Jun 7, 2024

EmilyXinyi commented Jun 11, 2024

ogrisel commented Jun 14, 2024 •

edited

Loading

EmilyXinyi commented Jun 19, 2024

EdAbati commented Jun 23, 2024

OmarManzoor commented Jun 28, 2024

ogrisel commented Jun 28, 2024 •

edited

Loading

EmilyXinyi commented Jun 28, 2024

Make more of the "tools" of scikit-learn Array API compatible #26024

Make more of the "tools" of scikit-learn Array API compatible #26024

Comments

betatim commented Mar 30, 2023 • edited by ogrisel Loading

betatim commented Mar 30, 2023 • edited by ogrisel Loading

betatim commented Apr 19, 2023

betatim commented Apr 20, 2023 • edited Loading

ogrisel commented Apr 20, 2023

thomasjpfan commented Apr 27, 2023

ogrisel commented May 17, 2023 • edited Loading

EdAbati commented Aug 18, 2023

elindgren commented Aug 18, 2023

OmarManzoor commented Aug 21, 2023

ogrisel commented Aug 21, 2023

EdAbati commented Aug 21, 2023

rotuna commented Sep 21, 2023

rotuna commented Sep 26, 2023 • edited Loading

EdAbati commented Sep 27, 2023 • edited Loading

EdAbati commented Oct 9, 2023 • edited Loading

Tialo commented May 24, 2024

OmarManzoor commented May 27, 2024

ogrisel commented Jun 6, 2024

EmilyXinyi commented Jun 6, 2024

EmilyXinyi commented Jun 7, 2024

EdAbati commented Jun 7, 2024

EmilyXinyi commented Jun 11, 2024

ogrisel commented Jun 14, 2024 • edited Loading

EmilyXinyi commented Jun 19, 2024

EdAbati commented Jun 23, 2024

OmarManzoor commented Jun 28, 2024

ogrisel commented Jun 28, 2024 • edited Loading

EmilyXinyi commented Jun 28, 2024

betatim commented Mar 30, 2023 •

edited by ogrisel

Loading

betatim commented Mar 30, 2023 •

edited by ogrisel

Loading

betatim commented Apr 20, 2023 •

edited

Loading

ogrisel commented May 17, 2023 •

edited

Loading

rotuna commented Sep 26, 2023 •

edited

Loading

EdAbati commented Sep 27, 2023 •

edited

Loading

EdAbati commented Oct 9, 2023 •

edited

Loading

ogrisel commented Jun 14, 2024 •

edited

Loading

ogrisel commented Jun 28, 2024 •

edited

Loading