-
Notifications
You must be signed in to change notification settings - Fork 182
[CI, enhancement] add pytorch+gpu testing ci #2494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 18 files with indirect coverage changes 🚀 New features to boost your workflow:
|
onedal/datatypes/_data_conversion.py
Outdated
@@ -18,6 +18,9 @@ | |||
|
|||
from onedal import _default_backend as backend | |||
|
|||
kDLCPU = backend.kDLCPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those global symbols necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You highlight something that I did which wasn't so perfect. The dlpack device data should be on CPU, the most sure way to do that is to check if its the kDLCPU enum. This will only be used in _transfer_to_host, where the kDLOneAPI will be used for recognizing if a queue can be extracted. I need to get it in onedal/_device_offload.py
do you think I should import it from the backend there directly? I put it in _data_conversion
since its a data conversion topic, but it required importing it through 2-3 spots to get it in _device_offload. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would always import it straight from the original source. Let me know if that causes any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearnex/utils/validation.py
Outdated
# this try-catch is a PyTorch-specific fix, as Tensor.size is a function. | ||
# The try-catch minimizes changes to most common code path (numpy arrays). | ||
try: | ||
too_small = X.size < 32768 | ||
except TypeError: | ||
too_small = math.prod(X.shape) < 32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this work?
# this try-catch is a PyTorch-specific fix, as Tensor.size is a function. | |
# The try-catch minimizes changes to most common code path (numpy arrays). | |
try: | |
too_small = X.size < 32768 | |
except TypeError: | |
too_small = math.prod(X.shape) < 32768 | |
too_small = (math.prod(X.shape) if xp is torch else X.size) < 32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its definitely cleaner. The only problem is that we would have to have imported torch, which should be avoided unless absolutely necessary. I could modify your suggestion to use array_api_compat's is_torch_array
(https://github.com/data-apis/array-api-compat/blob/main/array_api_compat/common/_helpers.py#L149). Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so we do this?
is_torch = is_torch_array(X)
too_small = (math.prod(X.shape) if is_torch else X.size) < 32768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me beyond resolution of open conversations. But this change is large enough that it'd be good to wait til we can validate with internal CI before merge
Also huge props for the thorough description - makes reviewing a lot easier
.github/workflows/ci.yml
Outdated
bash .ci/scripts/describe_system.sh | ||
- name: Install test requirements | ||
run: | | ||
pip install -r dependencies-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe installation of dependencies-dev be skipped for test env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting. You are absolutely right, if we have implemented requirements-test.txt properly, then dependencies-dev shouldn't be necessary. I'll remove it and see what happens, if it works, now we would have a way of testing publicly that they work as intended separately (which we otherwise don't currently do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! looks like it works.
@icfaust It's a general world-wide GCP outage. |
/intelci: run |
1 similar comment
/intelci: run |
Had to switch to just using |
/intelci: run |
* Update _dataframes_support.py * Update conftest.py * Update conftest.py * Update _dataframes_support.py * Update _dataframes_support.py * Update _dataframes_support.py * Update _dataframes_support.py * Update _dataframes_support.py * Update _dataframes_support.py * Update _dataframes_support.py * Update conftest.py * Update conftest.py * Update conftest.py * Update _dataframes_support.py * Update run_test.sh * Update conftest.py * Update run_test.sh * Update run_test.sh * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update run_sklearn_tests.sh * Update ci.yml * Update ci.yml * Update validation.py * Update validation.py * Update validation.py * Update validation.py * Update _device_offload.py * Update _data_conversion.py * Update __init__.py * Update run_sklearn_tests.sh * Update _device_offload.py * Update _data_conversion.py * Update validation.py * Update table.cpp * Update _device_offload.py * Update _device_offload.py * Update deselected_tests.yaml * Update deselected_tests.yaml * Update _device_offload.py * Update deselected_tests.yaml * Update ci.yml * Update ci.yml * Update deselected_tests.yaml * Update ci.yml * Update deselected_tests.yaml * Update validation.py * Update ci.yml * Update __init__.py * Update _data_conversion.py * Update _device_offload.py * Update ci.yml * Update _device_offload.py * Update validation.py
Description
This PR introduces a public GPU CI job to sklearnex. It is not fully featured but does provide first GPU testing publicly. Due to issues with n_jobs support (which are being addressed in #2364), run times are extremely long but viable. The GPU is only currently used in the sklearn conformance steps, not in sklearnex/onedal testing. This is because it tests without
dpctl
installed for GPU offloading. It will in the future extract queues from data in combination withPyTorch
, which has Intel GPU capabilities since PyTorch 2.4 (https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html). This will allow GPU testing in the other steps.This CI is important for at least 3 reasons: sklearn tests array_api using CuPy, PyTorch and array_api_strict frameworks. PyTorch is the only non
__sycl_usm_array_interface__
GPU data framework which is expected to work for both sklearn and sklearnex. Therefore 1) it provides an array_api-only GPU testing framework to validate with sklearn conformance 2) it is likely the first entry point for users who which to use Intel GPU data natively (due to the user base size). 3) It validates that sklearnex can properly function without dpctl installed for GPU use, removing limitations on python versions and dependency stability issues. Note PyTorch DOES NOT FOLLOW THE ARRAY_API STANDARD, sklearn uses array_api_compat to shoe-horn in pytorch support. There are quirks associated with PyTorch and should be tested by sklearnex. This has impacts on how we design our estimators, as checking for__array_namespace__
is insufficient if we wish to support PyTorch.Unlike other public runners, it takes a strategy of splitting apart the build and test steps in to separate jobs. The test step occurs on a CPU-only runner and on a GPU runner at the same time. It does not use a virtual environment like Conda or venv for simplicity, however it can reuse all of the previously written infrastructure.
It uses Python 3.12 and sklearn 1.4 due to simplicity (i.e. to duplicate other GPU testing systems). This will be updated in a follow up PR as it becomes further used (likely requiring different deselections).
When successful, a large increase in code coverage should be observed in codecov, as code coverage is also made available.
This should be very important for validating array_api changes in the codebase coming soon, which would otherwise be obscured by
dpctl
.This required the following changes:
run_sklearn_tests.sh
were required to get the gpu deselections to work publicly.assert_all_finite
would fail in combination with array_api_dispatching, changes are made in daal4py, to only use DAAL in the case it is numpy or a dataframe. As PyTorch has a different use for thesize
attribute, changes needed to be made for it.__dlpack_device__
attribute instead, and then useasarray
if__array__
is available, orfrom_dlpack
because the__dlpack__
attribute is available. This required exposing some dlpack enums for verification.array-api-compat
and enabling array api conformance tests #2079)test_learning_curve_some_failing_fits_warning[42]
because of unknown issue with_intercept_
andSVC
on GPU. (Must be investigated)This will require the following PRs afterwards (by theme):
onedal/tests/utils/_dataframes_support.py
andonedal/tests/utils/_device_selection.py
to have public GPU testing in sklearnex.from_data
inonedal/utils/_sycl_queue_manager.py
to extract queues from__dlpack__
data (special PyTorch interface already in place in pybind11).torch
,dpnp
, anddpctl.tensor
due to load times in a centralized way (likely following strategy laid out inarray_api_compat
).SVC
and_intercept_
attribute for (test_learning_curve_some_failing_fits_warning[42]
sklearn conformance test)No performance benchmarks necessary
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance