Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC _convert_container #28681

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented Mar 23, 2024

Follow up of #28521, see #28521 (comment).

This PR splits constructor_name into constructor_type and constructor_lib, and adds two parameters sparse_container and sparse_format to reflect dense versus sparse. /cc @glemaitre

Mapping from the original constructor_name to the specifications in this PR

constructor_type constructor_lib sparse_container sparse_format
Default "pandas" None "csr"
"list" "list"
"tuple" "tuple"
"slice" "slice"
"array" "array"
"sparse"/"sparse_csr" "array" "matrix"
"sparse_csc" "array" "matrix" "csc"
"sparse_csr_array" "array" "array"
"sparse_csc_array" "array" "array" "csc"
"dataframe"/"pandas" "dataframe"
"polars" "dataframe" "polars"
"pyarrow" "dataframe" "pyarrow"
"series" "series"
"polars_series" "series" "polars"
"index" "index"

Some rules of when each parameter is applicable

  • sparse_container has no effect unless constructor_type="array"
  • sparse_format has no effect unless constructor_type="array" and sparse_container is not None
  • constructor_lib has no effect unless constructor_type is one of "dataframe", "series", and "index"
  • min_version has no effect unless constructor_lib is used
  • column_names (renamed from columns_name) has no effect unless constructor_type="dataframe"
  • categorical_feature_names has no effect unless constructor_type="dataframe"

Constraints on the shape of the input container

  • constructor_type="slice" requires 1-dimensional container with exactly 2 elements
  • constructor_type="dataframe" requires 2-dimensional container (previously it converts (n,) shape to (n, 1) shape)
  • constructor_type="series" requires 1-dimensional container (previously it creates series with each element being a list when given 2-dimensional container)
  • constructor_type="index" requires 1-dimensional container (similar to "series")

Copy link

github-actions bot commented Mar 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5d89d71. Link to the linter CI: here

@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review March 24, 2024 04:03
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Charlie-XIAO. Here are some comments. I'm +1 for the refactoring of _convert_container.

sklearn/utils/_testing.py Outdated Show resolved Hide resolved
sklearn/utils/_testing.py Outdated Show resolved Hide resolved
sklearn/utils/_testing.py Outdated Show resolved Hide resolved
sklearn/utils/_testing.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_indexing.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

Once the CIs are green, I'll review ;) (I am buying a bit of time).
But indeed, I really like the refactoring. Thanks for taking care of it @Charlie-XIAO.

@glemaitre glemaitre self-requested a review March 28, 2024 06:51
@Charlie-XIAO
Copy link
Contributor Author

Now I have resolved all suggestions from Jérémie, and CI is green :)

@glemaitre
Copy link
Member

I'm going to resolve the conflicts and make a review.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments before to look at the tests.

sklearn/utils/_testing.py Outdated Show resolved Hide resolved
if dtype is not None:
container = container.astype(dtype, copy=False)
else:
container = np.asarray(container, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get it right, we are going to convert any dataframe to an array at this stage. I'm not sure this is wise since I could imagine converting a dataframe pandas into a dataframe polars but I would not need to go back to numpy, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some way, I would as much as possible delay calling asarray to be sure that we actually need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think that much of performance given this is a testing utility, but your point is valid. This would mean refactoring and I will do it when I get some full block of time :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think that much of performance given this is a testing utility, but your point is valid.

I'm a bit more scared about the dtype casting where will enforce a single dtype while we might be interested in dataframe with heterogeneous dtype and just trying to test for pandas and polars. Passing by NumPy in the middle could make it a mess :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true, missed this point.

sklearn/utils/_testing.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants