Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT downcast indices dtype when converting sparse arrays #27372

Merged
merged 24 commits into from Oct 30, 2023

Conversation

glemaitre
Copy link
Member

The indices dtype of sparse arrays is different from sparse matrices. This PR modifies check_array to have a consistent behaviour.

The reason to do so is to not have any regression on low-level code that typed indices to be 32-bits precision as seen in #27240. Not that this typing is not only in scikit-learn which makes it more difficult to handle and can lead to regression in the future.

The main issue is the conversion from DIA arrays to CSR/COO arrays.

@github-actions
Copy link

github-actions bot commented Sep 14, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 31eebd7. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments:

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
glemaitre and others added 2 commits September 15, 2023 21:46
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass, mostly about wording, and making the test cases easier to understand.

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member Author

@ogrisel I reworked completely the test for the new helper. Sorry to not have not check it.
I parametrize it as well. I think this is more readable.

I also rename the variable in the _ensure_sparse_format. I checked and this is only used one and it is private so it should be a problem.

@glemaitre
Copy link
Member Author

I checked codecov and we should not have an issue here.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of feedback.

Once #27346 is merged in main, we should merge main into this branch and push a commit with [pyodide] in the commit message to check that it works as expected on that platform or skip tests that cannot be run on that platform.

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass of suggestions to further simplify the code and improve comments + cosmetic changes.

Besides this, LGTM.

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
# be converted to int32,
(
{"arrays": np.array([1], dtype=np.int64), "check_contents": True},
np.dtype("int32"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.dtype("int32"),
np.int32,

"arrays": np.array([np.iinfo(np.int32).max + 1], dtype=np.uint32),
"check_contents": True,
},
np.dtype("int64"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.dtype("int64"),
np.int64,

"check_contents": True,
"maxval": np.iinfo(np.int32).max + 1,
},
np.dtype("int64"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.dtype("int64"),
np.int64,

"check_contents": True,
"maxval": 1,
},
np.dtype("int64"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.dtype("int64"),
np.int64,

[
({}, np.dtype("int32")), # default behaviour
({"maxval": np.iinfo(np.int32).max}, np.dtype("int32")),
({"maxval": np.iinfo(np.int32).max + 1}, np.dtype("int64")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please do a big search and replace for np.dtype("int32") to just np.int32 and similar for np.dtype("int64")?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Comment on lines 580 to 608
# With SciPy sparse arrays, conversion from DIA format to COO, CSR, or BSR triggers
# the use of `np.int64` indices even if the data is such that it could be more
# efficiently represented with `np.int32` indices.
# https://github.com/scipy/scipy/issues/19245
# Since not all scikit-learn algorithms support large indices, the following code
# downcasts to `np.int32` indices when it's safe to do so.
if (
sparse_container_type_name == "dia_array"
and changed_format
and accept_sparse[0] in ("csr", "coo")
):
if accept_sparse[0] == "csr":
index_dtype = _smallest_admissible_index_dtype(
arrays=(sparse_container.indptr, sparse_container.indices),
maxval=max(sparse_container.nnz, sparse_container.shape[1]),
check_contents=True,
)
sparse_container.indices = sparse_container.indices.astype(
index_dtype, copy=False
)
sparse_container.indptr = sparse_container.indptr.astype(
index_dtype, copy=False
)
else: # accept_sparse[0] == "coo"
index_dtype = _smallest_admissible_index_dtype(
maxval=max(sparse_container.shape)
)
sparse_container.row = sparse_container.row.astype(index_dtype, copy=False)
sparse_container.col = sparse_container.col.astype(index_dtype, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to _fixes.py with a clear comment as when we can remove this? aka when the next scipy release becomes our minimum requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@glemaitre
Copy link
Member Author

@adrinjalali Would you be able to merge this PR after that I moved the code into fixes.py?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!

@adrinjalali adrinjalali merged commit 56f6477 into scikit-learn:main Oct 30, 2023
27 checks passed
@glemaitre
Copy link
Member Author

Thanks @adrinjalali

RUrlus pushed a commit to RUrlus/scikit-learn that referenced this pull request Oct 30, 2023
…arn#27372)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023
…arn#27372)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…arn#27372)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants