Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Allow object dtype in series #5886

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yarnabrina
Copy link
Collaborator

Reference: #5867 (comment)

@yarnabrina
Copy link
Collaborator Author

@fkiraly I see a new failure already, but I don't know what part of my change is affecting this. Can you guide?

Also, I have a question. What is the difference between mtype and scitype?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 3, 2024

scitype = abstract data type, e.g., time series, or collection of time series
mtype = full python specification

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 3, 2024

The failures are due to a secondary problem, namely that objects of mtype nested_univ now also pass the check for pd.DataFrame, in addition to the nested_univ check.

That means the two mtypes are no longer distinguishable by checks, which leads to further trouble as the data checking layer can no longer decide what mtype it shoud assume

To fix this, we need to ensure tha nested_univ does not pass the pd.DataFrame check.

An idea would be to check that, in object typed columns, the first row entry is not a pd.Series.

@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels Feb 3, 2024
@yarnabrina yarnabrina marked this pull request as draft February 4, 2024 05:37
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 5, 2024

should I try to do that?

@yarnabrina
Copy link
Collaborator Author

yarnabrina commented Feb 5, 2024 via email

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 5, 2024

@yarnabrina, your remote is rejecting push. You need to check the box "allow maintainers to make changes".

To fix, add following lines starting line 66 of _check._series:

    # check to delineate from nested_univ mtype (Panel)
    # pd.DataFrame mtype allows object dtype,
    # but if we allow object dytpe with pd.Series entries,
    # the mtype becomes ambiguous, i.e., non-delineable from nested_univ
    if isinstance(obj.iloc[0, 0], (pd.Series, pd.DataFrame)):
        msg = f"{var_name} cannot contain nested pd.Series or pd.DataFrame"
        return ret(False, msg, None, return_metadata)

@yarnabrina
Copy link
Collaborator Author

You need to check the box "allow maintainers to make changes".

Done, can you please check now?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 5, 2024

ok, works

@yarnabrina
Copy link
Collaborator Author

@fkiraly I'll try to get back to this PR on the weekend. Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 8, 2024

Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype.

tried to fix - failure is incomplete edge case treatment of empty data frames (zero rows or cols)

@yarnabrina yarnabrina marked this pull request as ready for review February 9, 2024 14:47
@yarnabrina
Copy link
Collaborator Author

@fkiraly making it as ready, all tests pass after that additional check you added.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, works now and I'd be happy with it

The change I made was the additional check in now lines 66-72 of _series._check, which allows the system to delineate nested_univ from pd.DataFrame mtype, at least in all test cases. Empty pd.DataFrame count as the latter.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I noticed multiple things that imo we need to address before this:

  • Panel and Hierarchical types still forbid object
  • we are not systematically testing this input type - from now, users will not receive clear warning.
  • possibly, many estimators will break if given object dtype-d data - we need to think about what to do. Options: some kind of coercion; having estimator tags which tell the user which dtypes are supported, and an informative message is raised based on that.
    • none seems fully satisfactory, as this is very pandas specific.

@yarnabrina
Copy link
Collaborator Author

Panel and Hierarchical types still forbid object

I actually skipped this one intentionally. I thought it may be better to allow separately for different types. Unless you think they are interlinked and can cause issues, I really have very little, if any, familiarity with how scitype/mtype/etc. works in sktime?

many estimators will break if given object dtype-d data

This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular. pandas has recently planning to use even a dedicated "string" type, I think I saw in their latest release notes.

we are not systematically testing this input type - from now, users will not receive clear warning.

I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 10, 2024

Unless you think they are interlinked and can cause issues

Yes, they are linked through broadcasting.

One potential failure case that is not covered due to lack of examples for the contract expansion is as follows:

suppose we have a transformer that internally is implemented only for Panel (e.g., for efficiency reasons).
Upon receiving a Series, it will be upcast to Panel, and suddenly the object no longer satisfies the contract.

Similarly, there is now an inconsistency the other way round - even if an estimator would now accept an individual series, it would no longer accept it if it is found with others of similar type in a collection of series.

PS: I think for Panel, the trick with the iloc is not needed, as the delineation happens via the MultiIndex already.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 10, 2024

This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular

I do think ew should support it, but "properly".
Here's what I suggest:

  • extend the types for Panel and Hierarchical as well
  • in scenarios_forecasting and scenarios_transformers add coverage for this
    • if not too much breaks, hooray - fix the few bugs etc
    • if a lot breaks, we need to consider how we make the contract tight. A way would be introducing tags for the supported types.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 10, 2024

we are not systematically testing this input type - from now, users will not receive clear warning.

I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users?

Consider the situation of an estimator that does not support the type genuinely, i.e., would break internally of pd.DataFrame with object dtype is passed.

Suppose a user passes a pd.DataFrame with object dtype to that estimator. Currently, on main, this will fail with an informative error message, namely "object dtype is not supported" or similar. This is because the check catches it, and also returns an informative error message together with the check.

However, if we widen the contract without testing and adding catches where they fail, then in this estimator something random will break, and the error will be uninformative, e.g., a ValueError or type error or similar. This is of course not good.

From a higher poin of view, by widening the contract, we may now be creating uncovered bugs. So either we have to move the contract boundary "further in" by one of the measures proposed above (e.g., tag and catch), or we have to ensure all estimators satisfy the new contract. In any case, it has to be tested.

@yarnabrina yarnabrina marked this pull request as draft February 15, 2024 18:08
yarnabrina added a commit to yarnabrina/sktime-fork that referenced this pull request Feb 16, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 16, 2024

well, that's strange. One of the notebooks fails, while the test suite is fine.

We need to identify what object is being passed to the checkers, and why it is ambiguous. Perhaps it has mixed columns?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 16, 2024

oh wait, the problem is actually Panel. I am reverting the change I made to the Series checks.

We just need to add the same trick in pd-multiindex checker.

For Hierarchical, this will not be needed.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 16, 2024

to avoid conflicts: do you want ot add the mini-check in Panel, or should I?

@yarnabrina
Copy link
Collaborator Author

yarnabrina commented Feb 16, 2024

Please do. I've very little experience and confidence in this datatypes modules.

I just faced weird set of conflicts. Fixed locally, so let me make one last attempt.

Copy link
Collaborator Author

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a rebase from main and force push, so I think you will get conflicts. So, if possible can you delete your local branch and pull this?

Also, I do not know where are the other changes you said are necessary to do, so can you please apply that? I'll not modify this branch anymore to avoid future conflicts, deleting from my local.

Comment on lines +193 to +196
# check to delineate from nested_univ mtype (Hierarchical)
# pd.DataFrame mtype allows object dtype,
# but if we allow object dtype with Panel entries,
# the mtype becomes ambiguous, i.e., non-delineable from nested_univ
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy pasted your comment and tried to edit two terms, but I'm certain this is wrong. So, please review and change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should work though, let's see what the tests say

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants