Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

Closed
NV-jpt opened this issue Aug 30, 2021 · 3 comments · Fixed by #858

Comments

@NV-jpt
Copy link
Contributor

NV-jpt commented Aug 30, 2021

General Introduction and Relevance

Currently, many imbalanced-learn samplers require an estimator explicitly inherits from Scikit-learn Mixin classes in order to determine the nature of an estimator operator; often this is achieved with the method check_neighbors_object() .

This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency, but also allow users to use other scikit-learn API-compliant libraries with imbalanced-learn that cannot directly subclass scikit-learn, such as cuML . Using cuML with imbalanced-learn can be significantly faster, as shown in the examples below that achieve roughly 180x and 680x speedups with the example configuration.

Additional Context

A key limitation is that users cannot cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.

The use of a duck typing based programming model would allow imbalanced-learn to be more flexible; duck typing in PyData is becoming increasingly powerful as more libraries are standardizing on numpy and scikit-learn like API contracts rather than inheritance (Dask, xarray, Pangeo, CuPy, RAPIDS, and more).

The TPOT community recently adopted duck-typing to allow GPU-accelerated CUML estimators and have seen great results: https://github.com/EpistasisLab/tpot/blob/master/tutorials/Higgs_Boson.ipynb

Proposal

In check_neighbors_object(), convert the Mixin class check to instead check key attributes that determine whether an object is a certain type of estimator:

Instead of KNeighborsMixin, check for the key attributes that determine whether an object is KNeighborsMixin-like

  • kneighbors()
  • kneighbors_graph()

If there is potential interest, I'd be happy to open up a pull request for further discussion.

Potential Impact

Adopting duck typing for these checks would immediately allow using cuML (a GPU-accelerated machine learning library with a scikit-learn-like API) with imbalanced-learn. It would also open the door to other libraries (current and future), without requiring imbalanced-learn to have any knowledge or involvement.

As a concrete example of the benefit, using cuML estimators on the GPU instead of scikit-learn can provide a large speedup. Faster estimators can make a significant difference in the total runtime. The benchmark graphs below provide a couple small examples (reproducing gist provided at the bottom) which correspond to 180x and 680x speedups using cuML on a GPU vs the default scikit-learn with CPU cores.

There are a number of samplers that this proposed change would affect; allowing for the integration of estimators that do not directly subclass scikit-learn. This list includes: ADASYN, SMOTE, BorderlineSMOTE, SMOTENC, EditedNearestNeighbours, and RepeatedEditedNearestNeighbours samplers.

Samplers, such as CondensedNearestNeighbour, KMeansSMOTE, SMOTEN, SMOTEENN, and others, incorporate additional steps in their _validate_estimator() methods, and will require additional modifications to enable their seamless integration with estimators that do not directly subclass scikit-learn.

adasyn
smote

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking Code:
https://gist.github.com/NV-jpt/276be3fe57b0ca384dbdabeba4a7e643

@glemaitre
Copy link
Member

We will be happy to integrate such changes. We already adopt this strategy in KMeansSMOTE regarding the KMeans estimator (to easily switch from KMeans to MiniBatchKMeans). So in the same spirit, we would be fine to accept something like that for the neighbours. We already thought about it thinking that switching to approximate nearest-neighbours algorithms would better scale.

Regarding taking profit from other array-like libraries (daks, cuML), we are also happy to look at it. I think that we still should make it optional (we don't want to force a user to install any of the dask or cuML by default). When working on #777 where I wanted to check what it takes to support dask array for Random*Sampler, I saw that indeed we need to bypass the validation of the array-like in scikit-learn and thus the same validation bypass in our base sampler class. I think this is feasible but it might be the most tricky part to do.

Bottom line, we would be happy to have the contributions, until we can get a couple of tests to check that everything work with the backend on one of the CI build.

@NV-jpt
Copy link
Contributor Author

NV-jpt commented Sep 2, 2021

Thank you for the positive response! I will open up a pull request for further discussion.

@NV-jpt NV-jpt changed the title [ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based [ENH] Switch estimator operator logical checks to be interface based rather than inheritance based Sep 13, 2021
@rrfaria
Copy link

rrfaria commented Nov 16, 2021

It would be awesome if I could use cuml because for now with a hige dataset in a standard method I took too much to oversampling

thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants