[docs] Add a prototype estimator for array api enabling #2534

icfaust · 2025-06-11T10:56:49Z

Description

First start by generating code which represents the basic requirements of a sklearnex and onedal estimator.

This PR serves two purposes: to ease understanding of the codebase for external development and standardize development
occurring for array API support.

Next will be to make the necessary doc page links to various aspects to act as a guide for array API development. Which will help in external user contribution.

My goal will be to see if I can get an LLM with this information to generate StandardScaler using BasicStatistics. If it can, that means an LLM can help guide a user with this starting prompt in more difficult scenarios.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

codecov · 2025-06-11T11:48:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag	Coverage Δ
azure	`?`
github	`71.62% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sklearnex/basic_statistics/basic_statistics.py	`89.70% <ø> (-0.15%)`	⬇️

... and 46 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

david-cortes-intel · 2025-06-11T11:05:37Z

sklearnex/tests/prototypes.py

+#
+# 1) All sklearnex estimators must inherit oneDALestimator and the sklearn
+# estimator that it is replicating (i.e. before in the mro).  If there is
+# not an equivalent sklearn estimator, then sklearn's BaseEstimator must be


Should also inherit from the corresponding type for what the estimator does, like RegressorMixin.

Added snippet to refer to the Mixins, though in most cases that should be handled by the underlying sklearn estimator, we need to be careful with sklearnex-only versions (and therefore good call).

david-cortes-intel · 2025-06-11T11:07:33Z

sklearnex/tests/prototypes.py

+# inherited.
+#
+# 2) ``check_is_fitted`` is required for any method in an estimator which
+# requires first calling ``fit`` or ``partial_fit``. This is a sklearn


It's actually not a requirement to call this specific sklearn function within .fit, only to make the estimator work correctly when that function is called on it:
https://scikit-learn.org/stable/developers/develop.html#developer-api-for-check-is-fitted

david-cortes-intel · 2025-06-11T11:09:24Z

sklearnex/tests/prototypes.py

+#
+# Sklearnex estimator methods can be thought of in 3 major tiers.
+#
+# Tier 1: Those methods which offload to oneDAL using ``dispatch``. Typical


Suggested change

# Tier 1: Those methods which offload to oneDAL using ``dispatch``. Typical

# Tier 1: Methods which offload to oneDAL using ``dispatch``. Typical

david-cortes-intel · 2025-06-11T11:10:33Z

sklearnex/tests/prototypes.py

+# examples are ``fit`` and ``predict``. They use a direct equivalent oneDAL
+# function for evaluation. These methods are of highest priority and have


Suggested change

# examples are ``fit`` and ``predict``. They use a direct equivalent oneDAL

# function for evaluation. These methods are of highest priority and have

# examples are ``fit`` and ``predict``. They use a direct equivalent function

# from oneDAL. These methods are of highest priority and have

david-cortes-intel · 2025-06-11T11:16:27Z

sklearnex/tests/prototypes.py

+    # sklearn, giving the ability to use raw data directly and avoiding
+    # sklearn pre-processing checks as necessary.
+    #
+    # 3) Pybind11 interfaces should be not be made public to the user unless


Suggested change

# 3) Pybind11 interfaces should be not be made public to the user unless

# 3) Pybind11 interfaces should not be be made public to the user unless

david-cortes-intel · 2025-06-11T11:25:29Z

sklearnex/tests/prototypes.py

+    # 4) Either sklearn is called, or a object from onedal is created and called
+    # using the input data. This process is handled in a function which has the
+    # prefix "_onedal_" followed by the method name. When fitting data, the
+    # returned onedal estimator object is stored as the ``_onedal_estimator``


How should serialization work for such objects? Should it all happen in PyBind11?

Thanks for the heads up on that. It is yet to be addressed in the equivalent onedal/tests/prototype.py file, and will be guaranteed to exist when PR is moved from draft.

david-cortes-intel · 2025-06-11T11:26:50Z

sklearnex/tests/prototypes.py

+    # 3) ``_onedal_gpu_supported`` or ``_onedal_cpu_supported`` creates a
+    # PatchingConditionsChain object, takes the input data and estimator
+    # parameters, and evaluates whether the estimator and data can be run using
+    # oneDAL. This information is logged to the `sklearnex` logger.


Logger is not imported in this example.

david-cortes-intel · 2025-06-11T11:28:16Z

sklearnex/tests/prototypes.py

+
+
+################
+# IMPORT NOTES #


Missing the requirement that estimators based off existing sklearn classes should add entries to the support table in the docs when added.

Thanks for the suggestion, added the section "Documentation notes".

david-cortes-intel · 2025-06-11T11:28:46Z

sklearnex/tests/prototypes.py

+# are ``score`` and ``predict_log_proba``. Oftentimes the additional
+# calculations are trivial, meaning benchmarking is not required.
+#
+# Tier 3: Those methods which directly use sklearn functionality. Typically


Missing the mechanism that we're using to inherit docstrings.

david-cortes-intel · 2025-06-11T11:52:28Z

sklearnex/tests/prototypes.py

+
+    def fit(self, X, y):
+        #
+        pass


This one should always return 'self'.

Vika-F

Thank you for adding this example! It makes many aspects of sklearnex implementation much clearer.

It would be also good to place a link to this file somewhere here:
https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/doc/sources/contribute.rst

Another [not super-important, but I have to say about it] thing that bothers me a bit is: how to maintain the validity of the recommendations here? For example, this get_namespace functionality was implemented several months ago. How the developer of a new product-wide decorator or method would know that this file also needs to be updated?

Vika-F · 2025-06-20T17:14:23Z

sklearnex/tests/prototypes.py

+# requires first calling ``fit`` or ``partial_fit``. This is a sklearn
+# requirement.
+#
+# 2) Every estimator should be decorated by ``control_n_jobs`` to properly


Suggested change

# 2) Every estimator should be decorated by ``control_n_jobs`` to properly

# 3) Every estimator should be decorated by ``control_n_jobs`` to properly

And so on.

Vika-F · 2025-06-20T17:26:01Z

sklearnex/tests/prototypes.py

+    # This includes inheriting the proper sklearn Mixin classes depending
+    # on the sklearnex estimator functionality.
+    #
+    # All estimators should be defined in a python file located in a folder


Suggested change

# All estimators should be defined in a python file located in a folder

# All estimators should be defined in a Python file located in a folder

"The name of our favorite programming language is always capitalized."
https://devguide.python.org/documentation/style-guide/#specific-words

Vika-F · 2025-06-20T17:37:14Z

sklearnex/tests/prototypes.py

+    # Sklearnex estimators follow a Matryoshka doll pattern with respect to
+    # the underlying oneDAL library. The sklearnex estimator is a
+    # public-facing API which mimics sklearn. Sklearnex estimators will
+    # create another estimator, defined in the ``onedal`` module, for
+    # having a python interface with oneDAL. Finally, this python object
+    # will use pybind11 to call oneDAL directly via pybind11-generated
+    # objects and functions This is known as the ``backend``. These are
+    # separate entities and do not inherit from one another. The clear
+    # separation has utility so long that the following rules are followed:


Can you bring a bit more structure to this part?
Because I think it is very important for the understanding of overall sklearnex implementation. But it is rather hard to grasp the idea when it is written as a single text block. Though I like the Matryoshka doll association =)

It can be something like:

The sklearnex estimator is a public facing API ...

The onedal estimator ...

The pybind11 backend ...
These are separate entities...

Vika-F · 2025-06-20T17:50:00Z

sklearnex/tests/prototypes.py

+        X, y = validate_data(
+            X, y, dtype=[xp.float64, xp.float32], ensure_all_finite=False
+        )


Maybe a stupid question, but shouldn't it check only dtype=[xp.float64] if only_float64 == True?

Vika-F · 2025-06-20T17:53:08Z

sklearnex/tests/prototypes.py

+        # See sklearn conventions about trailing underscores for fitted
+        # values.
+
+    def _onedal_predict(self, X, y, queue=None):


Suggested change

def _onedal_predict(self, X, y, queue=None):

def _onedal_predict(self, X, queue=None):

Vika-F · 2025-06-20T17:54:56Z

sklearnex/tests/prototypes.py

+
+        # In the ``fit`` method, a python onedal estimator object is
+        # generated.
+        self._onedal_estimator = onedal_PrototypeEstimator(check=True)


Why check=True and not check=check here?
Also, why only_float64 is not passed?

Vika-F · 2025-06-20T17:55:16Z

sklearnex/tests/prototypes.py

+        # The first step is to always acquire the namespace of input data
+        # This is important for getting the proper data types and possibly
+        # other conversions.
+        xp, _ = get_namespace(X, y)


Suggested change

xp, _ = get_namespace(X, y)

xp, _ = get_namespace(X)

Vika-F · 2025-06-20T17:58:34Z

sklearnex/tests/prototypes.py

+
+        elif method_name == "predict":
+            (X,) = data
+            xp = get_namespace(X, y)


Suggested change

xp = get_namespace(X, y)

xp = get_namespace(X)

Vika-F · 2025-06-20T17:59:09Z

sklearnex/tests/prototypes.py

+
+        elif method_name == "predict":
+            (X,) = data
+            xp = get_namespace(X, y)


Suggested change

xp = get_namespace(X, y)

xp = get_namespace(X)

icfaust added 3 commits June 11, 2025 12:48

initial information

3829036

more changes

1a4627c

missing space

90df3e3

david-cortes-intel reviewed Jun 11, 2025

View reviewed changes

icfaust added 9 commits June 11, 2025 14:22

interim changes

c4a6fdc

interim changes

b54fe71

interim update

befe555

step before adding tags and explanation

8f08da0

this will probably require splitting into a separate PR

7c06762

forgot return

2ab1084

fixes for CI

b34e741

move to follow proper development

7a4df2c

add docs

23a15d3

Vika-F reviewed Jun 20, 2025

View reviewed changes

	# Tier 1: Those methods which offload to oneDAL using ``dispatch``. Typical
	# Tier 1: Methods which offload to oneDAL using ``dispatch``. Typical

		# examples are ``fit`` and ``predict``. They use a direct equivalent oneDAL
		# function for evaluation. These methods are of highest priority and have

	# 3) Pybind11 interfaces should be not be made public to the user unless
	# 3) Pybind11 interfaces should not be be made public to the user unless

	# 2) Every estimator should be decorated by ``control_n_jobs`` to properly
	# 3) Every estimator should be decorated by ``control_n_jobs`` to properly

	# All estimators should be defined in a python file located in a folder
	# All estimators should be defined in a Python file located in a folder

	def _onedal_predict(self, X, y, queue=None):
	def _onedal_predict(self, X, queue=None):

[docs] Add a prototype estimator for array api enabling #2534

Are you sure you want to change the base?

[docs] Add a prototype estimator for array api enabling #2534

Uh oh!

Conversation

icfaust commented Jun 11, 2025

Description

Uh oh!

codecov bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vika-F left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 11, 2025 •

edited

Loading