New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC _convert_container
#28681
base: main
Are you sure you want to change the base?
RFC _convert_container
#28681
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Charlie-XIAO. Here are some comments. I'm +1 for the refactoring of _convert_container
.
Once the CIs are green, I'll review ;) (I am buying a bit of time). |
Now I have resolved all suggestions from Jérémie, and CI is green :) |
I'm going to resolve the conflicts and make a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments before to look at the tests.
if dtype is not None: | ||
container = container.astype(dtype, copy=False) | ||
else: | ||
container = np.asarray(container, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get it right, we are going to convert any dataframe to an array at this stage. I'm not sure this is wise since I could imagine converting a dataframe pandas into a dataframe polars but I would not need to go back to numpy, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some way, I would as much as possible delay calling asarray
to be sure that we actually need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think that much of performance given this is a testing utility, but your point is valid. This would mean refactoring and I will do it when I get some full block of time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think that much of performance given this is a testing utility, but your point is valid.
I'm a bit more scared about the dtype
casting where will enforce a single dtype
while we might be interested in dataframe with heterogeneous dtype and just trying to test for pandas and polars. Passing by NumPy in the middle could make it a mess :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true, missed this point.
Follow up of #28521, see #28521 (comment).
This PR splits
constructor_name
intoconstructor_type
andconstructor_lib
, and adds two parameterssparse_container
andsparse_format
to reflect dense versus sparse. /cc @glemaitreMapping from the original
constructor_name
to the specifications in this PRconstructor_type
constructor_lib
sparse_container
sparse_format
"pandas"
None
"csr"
"list"
"list"
"tuple"
"tuple"
"slice"
"slice"
"array"
"array"
"sparse"
/"sparse_csr"
"array"
"matrix"
"sparse_csc"
"array"
"matrix"
"csc"
"sparse_csr_array"
"array"
"array"
"sparse_csc_array"
"array"
"array"
"csc"
"dataframe"
/"pandas"
"dataframe"
"polars"
"dataframe"
"polars"
"pyarrow"
"dataframe"
"pyarrow"
"series"
"series"
"polars_series"
"series"
"polars"
"index"
"index"
Some rules of when each parameter is applicable
sparse_container
has no effect unlessconstructor_type="array"
sparse_format
has no effect unlessconstructor_type="array"
andsparse_container
is notNone
constructor_lib
has no effect unlessconstructor_type
is one of"dataframe"
,"series"
, and"index"
min_version
has no effect unlessconstructor_lib
is usedcolumn_names
(renamed fromcolumns_name
) has no effect unlessconstructor_type="dataframe"
categorical_feature_names
has no effect unlessconstructor_type="dataframe"
Constraints on the shape of the input container
constructor_type="slice"
requires 1-dimensional container with exactly 2 elementsconstructor_type="dataframe"
requires 2-dimensional container (previously it converts(n,)
shape to(n, 1)
shape)constructor_type="series"
requires 1-dimensional container (previously it creates series with each element being a list when given 2-dimensional container)constructor_type="index"
requires 1-dimensional container (similar to"series"
)