New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT downcast indices dtype when converting sparse arrays #27372
MAINT downcast indices dtype when converting sparse arrays #27372
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of comments:
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass, mostly about wording, and making the test cases easier to understand.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel I reworked completely the test for the new helper. Sorry to not have not check it. I also rename the variable in the |
I checked codecov and we should not have an issue here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of feedback.
Once #27346 is merged in main
, we should merge main
into this branch and push a commit with [pyodide]
in the commit message to check that it works as expected on that platform or skip tests that cannot be run on that platform.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass of suggestions to further simplify the code and improve comments + cosmetic changes.
Besides this, LGTM.
# be converted to int32, | ||
( | ||
{"arrays": np.array([1], dtype=np.int64), "check_contents": True}, | ||
np.dtype("int32"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype("int32"), | |
np.int32, |
"arrays": np.array([np.iinfo(np.int32).max + 1], dtype=np.uint32), | ||
"check_contents": True, | ||
}, | ||
np.dtype("int64"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype("int64"), | |
np.int64, |
"check_contents": True, | ||
"maxval": np.iinfo(np.int32).max + 1, | ||
}, | ||
np.dtype("int64"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype("int64"), | |
np.int64, |
"check_contents": True, | ||
"maxval": 1, | ||
}, | ||
np.dtype("int64"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype("int64"), | |
np.int64, |
[ | ||
({}, np.dtype("int32")), # default behaviour | ||
({"maxval": np.iinfo(np.int32).max}, np.dtype("int32")), | ||
({"maxval": np.iinfo(np.int32).max + 1}, np.dtype("int64")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please do a big search and replace for np.dtype("int32")
to just np.int32
and similar for np.dtype("int64")
?
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
sklearn/utils/validation.py
Outdated
# With SciPy sparse arrays, conversion from DIA format to COO, CSR, or BSR triggers | ||
# the use of `np.int64` indices even if the data is such that it could be more | ||
# efficiently represented with `np.int32` indices. | ||
# https://github.com/scipy/scipy/issues/19245 | ||
# Since not all scikit-learn algorithms support large indices, the following code | ||
# downcasts to `np.int32` indices when it's safe to do so. | ||
if ( | ||
sparse_container_type_name == "dia_array" | ||
and changed_format | ||
and accept_sparse[0] in ("csr", "coo") | ||
): | ||
if accept_sparse[0] == "csr": | ||
index_dtype = _smallest_admissible_index_dtype( | ||
arrays=(sparse_container.indptr, sparse_container.indices), | ||
maxval=max(sparse_container.nnz, sparse_container.shape[1]), | ||
check_contents=True, | ||
) | ||
sparse_container.indices = sparse_container.indices.astype( | ||
index_dtype, copy=False | ||
) | ||
sparse_container.indptr = sparse_container.indptr.astype( | ||
index_dtype, copy=False | ||
) | ||
else: # accept_sparse[0] == "coo" | ||
index_dtype = _smallest_admissible_index_dtype( | ||
maxval=max(sparse_container.shape) | ||
) | ||
sparse_container.row = sparse_container.row.astype(index_dtype, copy=False) | ||
sparse_container.col = sparse_container.col.astype(index_dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to _fixes.py
with a clear comment as when we can remove this? aka when the next scipy release becomes our minimum requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@adrinjalali Would you be able to merge this PR after that I moved the code into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE!
Thanks @adrinjalali |
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
The indices dtype of sparse arrays is different from sparse matrices. This PR modifies
check_array
to have a consistent behaviour.The reason to do so is to not have any regression on low-level code that typed indices to be 32-bits precision as seen in #27240. Not that this typing is not only in scikit-learn which makes it more difficult to handle and can lead to regression in the future.
The main issue is the conversion from DIA arrays to CSR/COO arrays.