Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Series scitype support for Polars #6485

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

pranavvp16
Copy link
Contributor

This PR adds series scitype support for polars . #5423

@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels May 27, 2024
@pranavvp16 pranavvp16 marked this pull request as ready for review May 29, 2024 13:25
@pranavvp16
Copy link
Contributor Author

I'm confused about what should be the naming convention for polars convert_dict and overall polars mtype. Whether it should be polars.DataFrame or pl.DataFrame, which one is preferred ?

@pranavvp16 pranavvp16 requested a review from fkiraly June 4, 2024 08:47
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 4, 2024

I'm confused about what should be the naming convention for polars convert_dict and overall polars mtype. Whether it should be polars.DataFrame or pl.DataFrame, which one is preferred ?

There is no fixed naming convention, pl.DataFrame works

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks reasonable, nice! I think you added all necessary objects to add the new data specification.

One issue that is still valid is that you are modifying the logic for the same method used for Table. This seems to make some metadata fields used by the Table based polars mtype wrong, or at least some redundant computations are executed.

What is index_cols, in the newly added code, in the Table case? Should we not rather skip the new logic entirely in that case?

@pranavvp16
Copy link
Contributor Author

Overall, this looks reasonable, nice! I think you added all necessary objects to add the new data specification.

One issue that is still valid is that you are modifying the logic for the same method used for Table. This seems to make some metadata fields used by the Table based polars mtype wrong, or at least some redundant computations are executed.

What is index_cols, in the newly added code, in the Table case? Should we not rather skip the new logic entirely in that case?

Updated the code which doesn't compute index_cols if scitype is Table. So index_cols remains a empty list with length zero not affecting the metadict, while index_cols are only searched when the scitype is Series, Panel, Hierarchical.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Can you please add polars to the all_extras dependency sets explicitly, so we also test these?

Another change request, for discussion: if no index column is present in the Series case, we should interpret as range index. Do you think that is a sound choice, or is this going to complicate things?

@pranavvp16
Copy link
Contributor Author

Looks good!

Can you please add polars to the all_extras dependency sets explicitly, so we also test these?

Another change request, for discussion: if no index column is present in the Series case, we should interpret as range index. Do you think that is a sound choice, or is this going to complicate things?

The tests will fail for metadata_inference as we don't have the is_equally_spaced key in the dict. I think I'll need to add it, should I add it in this PR?
Also adding a range index is a good idea and won't cause any trouble.

@pranavvp16 pranavvp16 requested a review from fkiraly June 10, 2024 15:33
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 10, 2024

Also adding a range index is a good idea and won't cause any trouble.

Apologies, I did not spot this - what I meant is, if a polars.DataFrame is passed and it has no index, we consider it having a RangeIndex, but we do not add it physically.

This is related to the relation between abstract data type and implementation:
https://en.wikipedia.org/wiki/Abstract_data_type#Implementation

If you have not seen ADT before, I recommend reading up quickly on them.

In that framework, the ADT is a DataFrame with index. As polars df do not have native index, we need to assign an interpretation of an index - but we do not need to assign the index physically.
I was suggesting that in the absence of __index__ columns, the polars.DataFrame is interpreted as having RangeIndex.

However, I can also see how it makes sense to make the internals uniform, in this case adding an actual range index column.
If you would like to do that, let's do it as follows:

  • add this logic to the identity conversion function, polars.DataFrame to polars.DataFrame
  • do not require an __index__ column to be present in the checks
  • ensure to add tests
  • do not add this in the Table type
  • do not mutate the input table, also add tests for this

@pranavvp16
Copy link
Contributor Author

  • What I infer is to create a new function and move the logic which adds __index__ to a identity function. i.e polars_dataframe_to_polars_dataframe.
  • I'm unable to understand which checks don't require __index__ to be represent can you be more specific, or please comment on the check in code.
  • Do not mutate the input table means creating a deep copy? So the input table remains as it is while internally it is assigned with the __index__ columns, if it doesn't consist any.
  • Last thing regarding failing tests, the pandas to polars conversion tests are passing locally in my setup. But looking at the failures I think they are arising from the conversion utlities directly renaming the input pandas columns leading to addition of __index__ in the original input pandas frame. The solution to this could be creating a deep copy of the pandas df so it doesn't rename the columns, or check whether the __index__ naming convention already exists.

@fkiraly Let me know if I have the correct idea of what you are talking about, and I'll start working on it ASAP

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 12, 2024

Do not mutate the input table means creating a deep copy?

Making a deepcopy is one way to not mutate inputs, but the requirement is more general and does not require making a deepcopy in general.

Suppose we have a function my_fun(x, y, z) with arguments x, y, z. "not mutating inputs" is asking for x, y, z to remain unchanged by calls of my_fun. This is especially relevant if the arguments can be mutable objects, and if a function looks like it should not mutate arguments, but in fact does.

In such a case, the mutation is called an unintended "side effect".

Functions that do not mutate arguments and are deterministic are also called "pure functions".

The above contains some keywords that you can search and which are covered extensively by internet sources: python mutable types, side effects, pure functions.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 12, 2024

I'm unable to understand which checks don't require __index__ to be represent can you be more specific, or please comment on the check in code.

I think currently the Series checks do not require the index, so there is no code location to point to. I added this as a part of a buillet point list which specifies how to deal with the RangeIndex-like case.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 12, 2024

Last thing regarding failing tests, the pandas to polars conversion tests are passing locally in my setup. But looking at the failures I think they are arising from the conversion utlities directly renaming the input pandas columns leading to addition of __index__ in the original input pandas frame.

Can you please give a reference, which tests are passing, and what are you running locally (preferably full code)?

My guess is that the issues go away once you ensure that all functions do not mutate inputs.

@pranavvp16
Copy link
Contributor Author

pranavvp16 commented Jun 12, 2024

@fkiraly I fixed the mutation issue. The functions don't mutate the input dataframe anymore. I think I have addressed all the concerns mentioned above. Series doesn't require to have a index column anymore.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looks like the mutation issue was in code I wrote?

I have reviewed in detail, some change requests:

  • skip is_monotonically in the lazy case, otherwise we need to compute the frame
  • I think the check for n_instances is not correct. It should be the number of unique levels when you remove the time stamp level. For lazy, we also need to return NA whenever we would be forced to compute.

Style:

  • give the scitypes dict a better name.
  • Assign n_vars = len(index_cols)

Copy link
Collaborator

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not actively monitoring this work, so only commenting instead of a thorough review.)

  1. I think we should not add dependencies directly without any lower/upper bound specification.
  2. Why is pyarrow getting added in polars support PR?

@pranavvp16
Copy link
Contributor Author

(I'm not actively monitoring this work, so only commenting instead of a thorough review.)

  1. I think we should not add dependencies directly without any lower/upper bound specification.
  2. Why is pyarrow getting added in polars support PR?

Yes adding dependency bound will be a good practice, and I'm thinking to add a upper bound for polars according to the current version.
For pyarrow, it is required for the pandas to polars conversion issue thread

@yarnabrina
Copy link
Collaborator

For pyarrow, it is required for the pandas to polars conversion issue thread

In that case, I feel it makes more sense to add the dependency as "polars[pandas]" instead a separate specification of "pyarrow". That should be enough.

Ref. https://github.com/pola-rs/polars/blob/760067c75c51a1aa7d09f3194e1bad10e9275e01/py-polars/pyproject.toml#L56

@pranavvp16
Copy link
Contributor Author

@yarnabrina yes this was a discussion topic on discord. I'm also thinking to add the same. FYI @fkiraly

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked what the failures mean? They seem to be systematic, and related to the change. I am not sure whether they are relevant, though. The test has a loop over mtypes.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 22, 2024

how odd that this is specific to macos.
the "monthly" frequency from pandas seems to be playing a role, perhaps related?

#6574
#6572
Here is the pandas bug report: pandas-dev/pandas#58974
Related: #6245

@pranavvp16 pranavvp16 requested a review from fkiraly June 26, 2024 05:02
@pranavvp16
Copy link
Contributor Author

I have addressed the comments and requested changes. There's a small concern regarding polars behaviour with string datetime format in pandas, as polars only supports datetime object. Therefore the conversion util fails when index is of type PeriodIndex and need to be converted into DatetimeIndex using to_timestamp. Possible solutions

  1. Raise error asking user to convert index into Datetime object
  2. Handle string datetime formats internally with to_timestamp

This seems out of scope for this PR, and needs to be tracked with a separate issue.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 26, 2024

Hm, the current expectation is that all containers that pass the check are converted without error - it can be a lossy conversion, e.g., losing information about the original index type.

So, instead of leaving this with a breaking error, I would just do to_timestamp in case of PeriodIndex. I would also add a test for this case - seems not tested in the current set of tests.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me!

Remaining quesions:

  • can you make sure with @julian-fong that the column name convention is aligned with skpro?
  • could you make sure that no error is raised for PeriodIndex based pandas, and that that case is covered by a test? You can always add a separate test in one of the test_ files if the examples format is too rigid.

@julian-fong
Copy link
Contributor

julian-fong commented Jun 26, 2024

Seems fine to me!

Remaining quesions:

  • can you make sure with @julian-fong that the column name convention is aligned with skpro?

In skpro : current convention if the user wishes to retain the index column is to set a new column named __index__. Since there are no multi-indices in skpro tables, it's just usually one column. In this pr, we will adopt that the multi-indices will be converted to __index__{name}, so that is pretty much inline with whatever convention skpro is using. As a side note: I've designed it so that if a user has already passed a name via df.index.name, then we copy that over as the column name in the polars dataframe without underscore or any inclusion of "index" text.

As for the columns, thats where different functionality could potentially occur. Since sktime doesn't have multi-index columns, this won't apply, but for knowledge's sake in case there was: current implementation in skpro is to use double undercodes before, after, and separating index column levels. For example: A multi-column pandas dataframe that looks like this:

        target                        
          0.05        0.10        0.25
4    66.772658   78.870513   99.085499
63   22.517743   34.615598   54.830583

will be mapped to

┌──────────────────┬─────────────────┬──────────────────┐
│ __target__0.05__ ┆ __target__0.1__ ┆ __target__0.25__ │
│ ---              ┆ ---             ┆ ---              │
│ f64              ┆ f64             ┆ f64              │
╞══════════════════╪═════════════════╪══════════════════╡

As for single level column pandas Dataframes, we won't be modifying or including any underscores, so polars outputs would look like:

┌─────────────┐
│ target      │
│ ---         │
│ f64         │
╞═════════════╡
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ …           │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
└─────────────┘

Hopefully that is more or less whats been designed in sktime? @pranavvp16 let me know what you think!

@pranavvp16
Copy link
Contributor Author

@julian-fong yess the index columns are converted to naming convention of __index__{index.name} keeping this convention also makes it easy to find what columns actually represent index columns in the polars DataFrame which helps in back conversion to pandas. For the multiindex columns I haven't seen any such data container in sktime, having multiple values in a single column.

@pranavvp16
Copy link
Contributor Author

I have made the requested changes and also added tests for dataframe with PeriodIndex in Series as well as Multi-index. Now the conversion util internally handles pandas.PeriodIndex by converting it to Datetime with to_timestamp.

@pranavvp16 pranavvp16 requested a review from fkiraly June 26, 2024 21:23
@julian-fong
Copy link
Contributor

julian-fong commented Jun 26, 2024

@julian-fong yess the index columns are converted to naming convention of __index__{index.name} keeping this convention also makes it easy to find what columns actually represent index columns in the polars DataFrame which helps in back conversion to pandas. For the multiindex columns I haven't seen any such data container in sktime, having multiple values in a single column.

Its a good idea, we could adopt a similar idea for skpro to keep the column naming convention for "__index__{name}" consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:datatypes datatypes module: data containers, checkers & converters
Projects
Status: Under review
Development

Successfully merging this pull request may close these issues.

None yet

4 participants