[enhancement] accelerate array_api inputs for sklearnex's `validate_data` and `_check_sample_weight` #2296

icfaust · 2025-02-02T21:53:16Z

Description

Continues dlpack work from #2275. validate_data and _check_sample_weight do not follow standard sklearnex offloading practice, namely they compute always wherever the data is (as the data movement could ruin any speedup provided by oneDAL, the algorithm is extraordinarily simple), and they do not patch out sklearn functions. Therefore, they must be enabled separately for array_api support. Since they are to be included in every zero-copy array_api supported algorithm, it is a prerequisite for enabling every other estimator.

Previously this aspect was controlled by the looking for the flags attribute, which is not in the array_api standard. The array api standard does not include python-facing attributes or methods which can show if C-contiguous or F-contiguous. However, the array_api standard requires dlpack support. The attributes of from a DLPack tensor can be checked for the memory layout instead. This PR introduces a special onedal backend function which extracts and checks the necessary memory layout (without taking ownership of the tensor). A python function is created which first checks and queries the flags or __dlpack__ attributes. If neither are available, it will return False triggering the sklearn _assert_all_finite. This is done as to_table will attempt to convert to a contiguous memory layout, which again will ruin the performance gain.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

codecov · 2025-02-02T22:51:15Z

Codecov Report

Attention: Patch coverage is 51.28205% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
onedal/datatypes/dlpack/dlpack_utils.cpp	43.75%	8 Missing and 1 partial ⚠️
onedal/datatypes/dlpack/data_conversion.cpp	50.00%	1 Missing and 6 partials ⚠️
onedal/utils/validation.py	66.66%	1 Missing and 1 partial ⚠️
onedal/datatypes/table.cpp	0.00%	0 Missing and 1 partial ⚠️

Flag	Coverage Δ
azure	`78.05% <75.00%> (-0.01%)`	⬇️
github	`70.43% <51.28%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sklearnex/utils/validation.py	`58.75% <100.00%> (+0.52%)`	⬆️
onedal/datatypes/table.cpp	`67.44% <0.00%> (ø)`
onedal/utils/validation.py	`62.17% <66.66%> (+0.12%)`	⬆️
onedal/datatypes/dlpack/data_conversion.cpp	`54.28% <50.00%> (ø)`
onedal/datatypes/dlpack/dlpack_utils.cpp	`54.16% <43.75%> (ø)`

... and 44 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

icfaust · 2025-03-19T14:54:34Z

/intelci: run

icfaust · 2025-03-21T15:47:29Z

/intelci: run

icfaust · 2025-03-23T19:11:12Z

/intelci: run

icfaust · 2025-03-23T22:33:51Z

/intelci: run

ahuber21 · 2025-03-24T08:08:30Z

onedal/utils/validation.py

+
+def is_contiguous(X):
+    if hasattr(X, "flags"):
+        return X.flags["C_CONTIGUOUS"] or X.flags["F_CONTIGUOUS"]


Just making sure: Both keys C_CONTIGUOUS and F_CONTIGUOUS will always be defined, with the appropriate True/False values?

Good question! This works for numpy/dpctl and dpnp, and is a numpy standard but not available in array_api.

icfaust and others added 6 commits February 2, 2025 16:08

update to uxlfoundation#2275

223be44

add dlpack contiguous check

b485456

clang-format

fb945df

Update validation.py

a31f863

Update data_conversion.cpp

a4bec4c

Update data_conversion.cpp

0009718

icfaust changed the title ~~[enhancement] array_api acceleration for sklearnex's validate_data and _check_sample_weight~~ [enhancement] accelerate array_api inputs for sklearnex's validate_data and _check_sample_weight Feb 2, 2025

icfaust added 4 commits February 3, 2025 02:22

Update validation.py

50a29e2

Merge branch 'main' into dev/dlpack_contiguous

07e9962

Update data_conversion.cpp

7233b0b

Update data_conversion.hpp

8c4f817

icfaust marked this pull request as ready for review March 19, 2025 14:55

icfaust requested a review from Alexsandruss as a code owner March 19, 2025 14:55

icfaust added 2 commits March 21, 2025 17:16

reduce duplicated code

2790c2b

fix mistake

527ac77

Merge branch 'uxlfoundation:main' into dev/dlpack_contiguous

0460f17

ahuber21 reviewed Mar 24, 2025

View reviewed changes

ahuber21 approved these changes Mar 24, 2025

View reviewed changes

Alexsandruss approved these changes Mar 24, 2025

View reviewed changes

icfaust merged commit 9ffe47a into uxlfoundation:main Mar 24, 2025
27 of 28 checks passed

icfaust mentioned this pull request Jun 3, 2025

[fix] Update comment in sklearnex/utils/validation.py #2505

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[enhancement] accelerate array_api inputs for sklearnex's `validate_data` and `_check_sample_weight` #2296

[enhancement] accelerate array_api inputs for sklearnex's `validate_data` and `_check_sample_weight` #2296

Uh oh!

icfaust commented Feb 2, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 2, 2025 •

edited

Loading

Uh oh!

icfaust commented Mar 19, 2025

Uh oh!

icfaust commented Mar 21, 2025

Uh oh!

icfaust commented Mar 23, 2025

Uh oh!

icfaust commented Mar 23, 2025

Uh oh!

ahuber21 Mar 24, 2025

Uh oh!

icfaust Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

[enhancement] accelerate array_api inputs for sklearnex's validate_data and _check_sample_weight #2296

[enhancement] accelerate array_api inputs for sklearnex's validate_data and _check_sample_weight #2296

Uh oh!

Conversation

icfaust commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov bot commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

icfaust commented Mar 19, 2025

Uh oh!

icfaust commented Mar 21, 2025

Uh oh!

icfaust commented Mar 23, 2025

Uh oh!

icfaust commented Mar 23, 2025

Uh oh!

ahuber21 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

icfaust Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[enhancement] accelerate array_api inputs for sklearnex's `validate_data` and `_check_sample_weight` #2296

[enhancement] accelerate array_api inputs for sklearnex's `validate_data` and `_check_sample_weight` #2296

icfaust commented Feb 2, 2025 •

edited

Loading

codecov bot commented Feb 2, 2025 •

edited

Loading