[ENH] Series scitype support for Polars #6485

pranavvp16 · 2024-05-27T16:09:30Z

This PR adds series scitype support for polars . #5423

sktime/datatypes/_adapter/polars.py

pranavvp16 · 2024-06-04T08:45:17Z

I'm confused about what should be the naming convention for polars convert_dict and overall polars mtype. Whether it should be polars.DataFrame or pl.DataFrame, which one is preferred ?

fkiraly · 2024-06-04T17:32:11Z

I'm confused about what should be the naming convention for polars convert_dict and overall polars mtype. Whether it should be polars.DataFrame or pl.DataFrame, which one is preferred ?

There is no fixed naming convention, pl.DataFrame works

fkiraly

Overall, this looks reasonable, nice! I think you added all necessary objects to add the new data specification.

One issue that is still valid is that you are modifying the logic for the same method used for Table. This seems to make some metadata fields used by the Table based polars mtype wrong, or at least some redundant computations are executed.

What is index_cols, in the newly added code, in the Table case? Should we not rather skip the new logic entirely in that case?

pranavvp16 · 2024-06-05T15:02:54Z

Overall, this looks reasonable, nice! I think you added all necessary objects to add the new data specification.

One issue that is still valid is that you are modifying the logic for the same method used for Table. This seems to make some metadata fields used by the Table based polars mtype wrong, or at least some redundant computations are executed.

What is index_cols, in the newly added code, in the Table case? Should we not rather skip the new logic entirely in that case?

Updated the code which doesn't compute index_cols if scitype is Table. So index_cols remains a empty list with length zero not affecting the metadict, while index_cols are only searched when the scitype is Series, Panel, Hierarchical.

fkiraly

Looks good!

Can you please add polars to the all_extras dependency sets explicitly, so we also test these?

Another change request, for discussion: if no index column is present in the Series case, we should interpret as range index. Do you think that is a sound choice, or is this going to complicate things?

pranavvp16 · 2024-06-07T11:27:01Z

Looks good!

Can you please add polars to the all_extras dependency sets explicitly, so we also test these?

Another change request, for discussion: if no index column is present in the Series case, we should interpret as range index. Do you think that is a sound choice, or is this going to complicate things?

The tests will fail for metadata_inference as we don't have the is_equally_spaced key in the dict. I think I'll need to add it, should I add it in this PR?
Also adding a range index is a good idea and won't cause any trouble.

fkiraly · 2024-06-10T16:16:10Z

Also adding a range index is a good idea and won't cause any trouble.

Apologies, I did not spot this - what I meant is, if a polars.DataFrame is passed and it has no index, we consider it having a RangeIndex, but we do not add it physically.

This is related to the relation between abstract data type and implementation:
https://en.wikipedia.org/wiki/Abstract_data_type#Implementation

If you have not seen ADT before, I recommend reading up quickly on them.

In that framework, the ADT is a DataFrame with index. As polars df do not have native index, we need to assign an interpretation of an index - but we do not need to assign the index physically.
I was suggesting that in the absence of __index__ columns, the polars.DataFrame is interpreted as having RangeIndex.

However, I can also see how it makes sense to make the internals uniform, in this case adding an actual range index column.
If you would like to do that, let's do it as follows:

add this logic to the identity conversion function, polars.DataFrame to polars.DataFrame
do not require an __index__ column to be present in the checks
ensure to add tests
do not add this in the Table type
do not mutate the input table, also add tests for this

pranavvp16 · 2024-06-10T21:02:04Z

What I infer is to create a new function and move the logic which adds __index__ to a identity function. i.e polars_dataframe_to_polars_dataframe.
I'm unable to understand which checks don't require __index__ to be represent can you be more specific, or please comment on the check in code.
Do not mutate the input table means creating a deep copy? So the input table remains as it is while internally it is assigned with the __index__ columns, if it doesn't consist any.
Last thing regarding failing tests, the pandas to polars conversion tests are passing locally in my setup. But looking at the failures I think they are arising from the conversion utlities directly renaming the input pandas columns leading to addition of __index__ in the original input pandas frame. The solution to this could be creating a deep copy of the pandas df so it doesn't rename the columns, or check whether the __index__ naming convention already exists.

@fkiraly Let me know if I have the correct idea of what you are talking about, and I'll start working on it ASAP

fkiraly · 2024-06-12T08:28:21Z

Do not mutate the input table means creating a deep copy?

Making a deepcopy is one way to not mutate inputs, but the requirement is more general and does not require making a deepcopy in general.

Suppose we have a function my_fun(x, y, z) with arguments x, y, z. "not mutating inputs" is asking for x, y, z to remain unchanged by calls of my_fun. This is especially relevant if the arguments can be mutable objects, and if a function looks like it should not mutate arguments, but in fact does.

In such a case, the mutation is called an unintended "side effect".

Functions that do not mutate arguments and are deterministic are also called "pure functions".

The above contains some keywords that you can search and which are covered extensively by internet sources: python mutable types, side effects, pure functions.

fkiraly · 2024-06-12T08:30:37Z

I'm unable to understand which checks don't require __index__ to be represent can you be more specific, or please comment on the check in code.

I think currently the Series checks do not require the index, so there is no code location to point to. I added this as a part of a buillet point list which specifies how to deal with the RangeIndex-like case.

fkiraly · 2024-06-12T08:31:51Z

Last thing regarding failing tests, the pandas to polars conversion tests are passing locally in my setup. But looking at the failures I think they are arising from the conversion utlities directly renaming the input pandas columns leading to addition of __index__ in the original input pandas frame.

Can you please give a reference, which tests are passing, and what are you running locally (preferably full code)?

My guess is that the issues go away once you ensure that all functions do not mutate inputs.

pranavvp16 · 2024-06-12T08:53:27Z

@fkiraly I fixed the mutation issue. The functions don't mutate the input dataframe anymore. I think I have addressed all the concerns mentioned above. Series doesn't require to have a index column anymore.

sktime/datatypes/_adapter/polars.py

fkiraly

Great! Looks like the mutation issue was in code I wrote?

I have reviewed in detail, some change requests:

skip is_monotonically in the lazy case, otherwise we need to compute the frame
I think the check for n_instances is not correct. It should be the number of unique levels when you remove the time stamp level. For lazy, we also need to return NA whenever we would be forced to compute.

Style:

give the scitypes dict a better name.
Assign n_vars = len(index_cols)

yarnabrina

(I'm not actively monitoring this work, so only commenting instead of a thorough review.)

I think we should not add dependencies directly without any lower/upper bound specification.
Why is pyarrow getting added in polars support PR?

pranavvp16 · 2024-06-19T07:33:53Z

(I'm not actively monitoring this work, so only commenting instead of a thorough review.)

I think we should not add dependencies directly without any lower/upper bound specification.

Why is pyarrow getting added in polars support PR?

Yes adding dependency bound will be a good practice, and I'm thinking to add a upper bound for polars according to the current version.
For pyarrow, it is required for the pandas to polars conversion issue thread

yarnabrina · 2024-06-19T18:58:29Z

For pyarrow, it is required for the pandas to polars conversion issue thread

In that case, I feel it makes more sense to add the dependency as "polars[pandas]" instead a separate specification of "pyarrow". That should be enough.

Ref. https://github.com/pola-rs/polars/blob/760067c75c51a1aa7d09f3194e1bad10e9275e01/py-polars/pyproject.toml#L56

pranavvp16 · 2024-06-19T19:43:00Z

@yarnabrina yes this was a discussion topic on discord. I'm also thinking to add the same. FYI @fkiraly

fkiraly

Have you checked what the failures mean? They seem to be systematic, and related to the change. I am not sure whether they are relevant, though. The test has a loop over mtypes.

fkiraly · 2024-06-22T17:26:50Z

how odd that this is specific to macos.
the "monthly" frequency from pandas seems to be playing a role, perhaps related?

#6574
#6572
Here is the pandas bug report: pandas-dev/pandas#58974
Related: #6245

pranavvp16 · 2024-06-26T05:16:32Z

I have addressed the comments and requested changes. There's a small concern regarding polars behaviour with string datetime format in pandas, as polars only supports datetime object. Therefore the conversion util fails when index is of type PeriodIndex and need to be converted into DatetimeIndex using to_timestamp. Possible solutions

Raise error asking user to convert index into Datetime object
Handle string datetime formats internally with to_timestamp

This seems out of scope for this PR, and needs to be tracked with a separate issue.

fkiraly · 2024-06-26T10:39:56Z

Hm, the current expectation is that all containers that pass the check are converted without error - it can be a lossy conversion, e.g., losing information about the original index type.

So, instead of leaving this with a breaking error, I would just do to_timestamp in case of PeriodIndex. I would also add a test for this case - seems not tested in the current set of tests.

sktime/datatypes/_adapter/polars.py

fkiraly

Seems fine to me!

Remaining quesions:

can you make sure with @julian-fong that the column name convention is aligned with skpro?
could you make sure that no error is raised for PeriodIndex based pandas, and that that case is covered by a test? You can always add a separate test in one of the test_ files if the examples format is too rigid.

julian-fong · 2024-06-26T20:11:04Z

Seems fine to me!

Remaining quesions:

can you make sure with @julian-fong that the column name convention is aligned with skpro?

In skpro : current convention if the user wishes to retain the index column is to set a new column named __index__. Since there are no multi-indices in skpro tables, it's just usually one column. In this pr, we will adopt that the multi-indices will be converted to __index__{name}, so that is pretty much inline with whatever convention skpro is using. As a side note: I've designed it so that if a user has already passed a name via df.index.name, then we copy that over as the column name in the polars dataframe without underscore or any inclusion of "index" text.

As for the columns, thats where different functionality could potentially occur. Since sktime doesn't have multi-index columns, this won't apply, but for knowledge's sake in case there was: current implementation in skpro is to use double undercodes before, after, and separating index column levels. For example: A multi-column pandas dataframe that looks like this:

        target                        
          0.05        0.10        0.25
4    66.772658   78.870513   99.085499
63   22.517743   34.615598   54.830583

will be mapped to

┌──────────────────┬─────────────────┬──────────────────┐
│ __target__0.05__ ┆ __target__0.1__ ┆ __target__0.25__ │
│ ---              ┆ ---             ┆ ---              │
│ f64              ┆ f64             ┆ f64              │
╞══════════════════╪═════════════════╪══════════════════╡

As for single level column pandas Dataframes, we won't be modifying or including any underscores, so polars outputs would look like:

┌─────────────┐
│ target      │
│ ---         │
│ f64         │
╞═════════════╡
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ …           │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
│ 1108.871039 │
└─────────────┘

Hopefully that is more or less whats been designed in sktime? @pranavvp16 let me know what you think!

pranavvp16 · 2024-06-26T21:20:03Z

@julian-fong yess the index columns are converted to naming convention of __index__{index.name} keeping this convention also makes it easy to find what columns actually represent index columns in the polars DataFrame which helps in back conversion to pandas. For the multiindex columns I haven't seen any such data container in sktime, having multiple values in a single column.

pranavvp16 · 2024-06-26T21:22:54Z

I have made the requested changes and also added tests for dataframe with PeriodIndex in Series as well as Multi-index. Now the conversion util internally handles pandas.PeriodIndex by converting it to Datetime with to_timestamp.

julian-fong · 2024-06-26T23:08:18Z

@julian-fong yess the index columns are converted to naming convention of __index__{index.name} keeping this convention also makes it easy to find what columns actually represent index columns in the polars DataFrame which helps in back conversion to pandas. For the multiindex columns I haven't seen any such data container in sktime, having multiple values in a single column.

Its a good idea, we could adopt a similar idea for skpro to keep the column naming convention for "__index__{name}" consistent.

Series mtype support

9263768

fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels May 27, 2024

fkiraly assigned pranavvp16 May 27, 2024

Add check for monotonically increasing index

aff6532

pranavvp16 marked this pull request as ready for review May 29, 2024 13:25

pranavvp16 requested review from achieveordie, benHeid, fkiraly and yarnabrina as code owners May 29, 2024 13:25

fkiraly reviewed May 29, 2024

View reviewed changes

sktime/datatypes/_adapter/polars.py Show resolved Hide resolved

fkiraly reviewed May 29, 2024

View reviewed changes

sktime/datatypes/_adapter/polars.py Outdated Show resolved Hide resolved

fkiraly reviewed May 29, 2024

View reviewed changes

sktime/datatypes/_adapter/polars.py Outdated Show resolved Hide resolved

pranavvp16 added 2 commits June 3, 2024 18:57

Enhance checks and metadata dict according to polars

f575436

Fix mtype check name and conversions for series

a908c13

pranavvp16 requested a review from fkiraly June 4, 2024 08:47

fkiraly requested changes Jun 5, 2024

View reviewed changes

remove unnecessary checks for Table

6a3c3a0

pranavvp16 requested a review from fkiraly June 6, 2024 05:16

pranavvp16 mentioned this pull request Jun 6, 2024

[ENH] Polars panel scitype support #6552

Open

fkiraly requested changes Jun 6, 2024

View reviewed changes

Add range index for Series and DataFrame with no index column

469638e

pranavvp16 requested a review from fkiraly June 10, 2024 15:33

pranavvp16 mentioned this pull request Jun 11, 2024

[ENH] Add test whether check and convert utilities mutate inputs #6577

Open

fkiraly reviewed Jun 13, 2024

View reviewed changes

sktime/datatypes/_adapter/polars.py Show resolved Hide resolved

fkiraly requested changes Jun 13, 2024

View reviewed changes

yarnabrina reviewed Jun 13, 2024

View reviewed changes

Check monotonically_increasing only for pl.DataFrame

5db4c5b

Add polars dependancy with pandas feature flag

6f352a3

pranavvp16 requested review from fkiraly and yarnabrina June 20, 2024 17:52

fkiraly requested changes Jun 21, 2024

View reviewed changes

pranavvp16 added 2 commits June 22, 2024 23:28

Merge branch 'sktime:main' into mtype_support

12243db

Merge branch 'sktime:main' into mtype_support

d4c551c

pranavvp16 requested a review from fkiraly June 26, 2024 05:02

fkiraly reviewed Jun 26, 2024

View reviewed changes

sktime/datatypes/_adapter/polars.py Outdated Show resolved Hide resolved

fkiraly requested changes Jun 26, 2024

View reviewed changes

support of PeriodIndex in pandas to polars conversion util

e147bbf

pranavvp16 requested a review from fkiraly June 26, 2024 21:23

pranavvp16 mentioned this pull request Jul 1, 2024

[ENH] Hierarchical scitype support for polars #6697

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Series scitype support for Polars #6485

[ENH] Series scitype support for Polars #6485

pranavvp16 commented May 27, 2024

pranavvp16 commented Jun 4, 2024

fkiraly commented Jun 4, 2024

fkiraly left a comment •

edited

Loading

pranavvp16 commented Jun 5, 2024

fkiraly left a comment •

edited

Loading

pranavvp16 commented Jun 7, 2024

fkiraly commented Jun 10, 2024 •

edited

Loading

pranavvp16 commented Jun 10, 2024

fkiraly commented Jun 12, 2024

fkiraly commented Jun 12, 2024

fkiraly commented Jun 12, 2024

pranavvp16 commented Jun 12, 2024 •

edited

Loading

fkiraly left a comment

yarnabrina left a comment

pranavvp16 commented Jun 19, 2024

yarnabrina commented Jun 19, 2024

pranavvp16 commented Jun 19, 2024

fkiraly left a comment

fkiraly commented Jun 22, 2024

pranavvp16 commented Jun 26, 2024

fkiraly commented Jun 26, 2024

fkiraly left a comment

julian-fong commented Jun 26, 2024 •

edited

Loading

pranavvp16 commented Jun 26, 2024

pranavvp16 commented Jun 26, 2024

julian-fong commented Jun 26, 2024 •

edited

Loading

[ENH] Series scitype support for Polars #6485

Are you sure you want to change the base?

[ENH] Series scitype support for Polars #6485

Conversation

pranavvp16 commented May 27, 2024

pranavvp16 commented Jun 4, 2024

fkiraly commented Jun 4, 2024

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

pranavvp16 commented Jun 5, 2024

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

pranavvp16 commented Jun 7, 2024

fkiraly commented Jun 10, 2024 • edited Loading

pranavvp16 commented Jun 10, 2024

fkiraly commented Jun 12, 2024

fkiraly commented Jun 12, 2024

fkiraly commented Jun 12, 2024

pranavvp16 commented Jun 12, 2024 • edited Loading

fkiraly left a comment

Choose a reason for hiding this comment

yarnabrina left a comment

Choose a reason for hiding this comment

pranavvp16 commented Jun 19, 2024

yarnabrina commented Jun 19, 2024

pranavvp16 commented Jun 19, 2024

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Jun 22, 2024

pranavvp16 commented Jun 26, 2024

fkiraly commented Jun 26, 2024

fkiraly left a comment

Choose a reason for hiding this comment

julian-fong commented Jun 26, 2024 • edited Loading

pranavvp16 commented Jun 26, 2024

pranavvp16 commented Jun 26, 2024

julian-fong commented Jun 26, 2024 • edited Loading

fkiraly left a comment •

edited

Loading

fkiraly left a comment •

edited

Loading

fkiraly commented Jun 10, 2024 •

edited

Loading

pranavvp16 commented Jun 12, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading