-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Series scitype support for Polars #6485
base: main
Are you sure you want to change the base?
Conversation
I'm confused about what should be the naming convention for polars |
There is no fixed naming convention, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks reasonable, nice! I think you added all necessary objects to add the new data specification.
One issue that is still valid is that you are modifying the logic for the same method used for Table
. This seems to make some metadata fields used by the Table
based polars mtype wrong, or at least some redundant computations are executed.
What is index_cols
, in the newly added code, in the Table
case? Should we not rather skip the new logic entirely in that case?
Updated the code which doesn't compute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Can you please add polars
to the all_extras
dependency sets explicitly, so we also test these?
Another change request, for discussion: if no index column is present in the Series
case, we should interpret as range index. Do you think that is a sound choice, or is this going to complicate things?
The tests will fail for |
Apologies, I did not spot this - what I meant is, if a This is related to the relation between abstract data type and implementation: If you have not seen ADT before, I recommend reading up quickly on them. In that framework, the ADT is a DataFrame with index. As polars df do not have native index, we need to assign an interpretation of an index - but we do not need to assign the index physically. However, I can also see how it makes sense to make the internals uniform, in this case adding an actual range index column.
|
@fkiraly Let me know if I have the correct idea of what you are talking about, and I'll start working on it ASAP |
Making a deepcopy is one way to not mutate inputs, but the requirement is more general and does not require making a deepcopy in general. Suppose we have a function In such a case, the mutation is called an unintended "side effect". Functions that do not mutate arguments and are deterministic are also called "pure functions". The above contains some keywords that you can search and which are covered extensively by internet sources: python mutable types, side effects, pure functions. |
I think currently the |
Can you please give a reference, which tests are passing, and what are you running locally (preferably full code)? My guess is that the issues go away once you ensure that all functions do not mutate inputs. |
@fkiraly I fixed the mutation issue. The functions don't mutate the input dataframe anymore. I think I have addressed all the concerns mentioned above. Series doesn't require to have a index column anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Looks like the mutation issue was in code I wrote?
I have reviewed in detail, some change requests:
- skip
is_monotonically
in thelazy
case, otherwise we need to compute the frame - I think the check for
n_instances
is not correct. It should be the number of unique levels when you remove the time stamp level. For lazy, we also need to returnNA
whenever we would be forced tocompute
.
Style:
- give the
scitypes
dict a better name. - Assign
n_vars
=len(index_cols)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not actively monitoring this work, so only commenting instead of a thorough review.)
- I think we should not add dependencies directly without any lower/upper bound specification.
- Why is pyarrow getting added in polars support PR?
Yes adding dependency bound will be a good practice, and I'm thinking to add a upper bound for polars according to the current version. |
In that case, I feel it makes more sense to add the dependency as "polars[pandas]" instead a separate specification of "pyarrow". That should be enough. |
@yarnabrina yes this was a discussion topic on discord. I'm also thinking to add the same. FYI @fkiraly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked what the failures mean? They seem to be systematic, and related to the change. I am not sure whether they are relevant, though. The test has a loop over mtypes.
how odd that this is specific to macos. #6574 |
I have addressed the comments and requested changes. There's a small concern regarding polars behaviour with string datetime format in pandas, as polars only supports datetime object. Therefore the conversion util fails when index is of type
This seems out of scope for this PR, and needs to be tracked with a separate issue. |
Hm, the current expectation is that all containers that pass the check are converted without error - it can be a lossy conversion, e.g., losing information about the original index type. So, instead of leaving this with a breaking error, I would just do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me!
Remaining quesions:
- can you make sure with @julian-fong that the column name convention is aligned with
skpro
? - could you make sure that no error is raised for
PeriodIndex
basedpandas
, and that that case is covered by a test? You can always add a separate test in one of thetest_
files if the examples format is too rigid.
In As for the columns, thats where different functionality could potentially occur. Since sktime doesn't have multi-index columns, this won't apply, but for knowledge's sake in case there was: current implementation in skpro is to use double undercodes before, after, and separating index column levels. For example: A multi-column pandas dataframe that looks like this:
will be mapped to
As for single level column pandas Dataframes, we won't be modifying or including any underscores, so polars outputs would look like:
Hopefully that is more or less whats been designed in |
@julian-fong yess the index columns are converted to naming convention of |
I have made the requested changes and also added tests for dataframe with PeriodIndex in Series as well as Multi-index. Now the conversion util internally handles |
Its a good idea, we could adopt a similar idea for |
This PR adds series scitype support for polars . #5423