New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Allow object dtype in series #5886
base: main
Are you sure you want to change the base?
Conversation
@fkiraly I see a new failure already, but I don't know what part of my change is affecting this. Can you guide? Also, I have a question. What is the difference between |
scitype = abstract data type, e.g., time series, or collection of time series |
The failures are due to a secondary problem, namely that objects of mtype That means the two mtypes are no longer distinguishable by checks, which leads to further trouble as the data checking layer can no longer decide what mtype it shoud assume To fix this, we need to ensure tha An idea would be to check that, in |
should I try to do that? |
I'm unlikely to any time this week, so please do if you have time. I'll
attempt over weekend otherwise.
…On Mon, Feb 5, 2024, 21:36 Franz Király ***@***.***> wrote:
should I try to do that?
—
Reply to this email directly, view it on GitHub
<#5886 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJMCQBD5YMLVE4G26S2D4RDYSD7RVAVCNFSM6AAAAABCYAU4BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGMZTQNRTGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@yarnabrina, your remote is rejecting push. You need to check the box "allow maintainers to make changes". To fix, add following lines starting line 66 of # check to delineate from nested_univ mtype (Panel)
# pd.DataFrame mtype allows object dtype,
# but if we allow object dytpe with pd.Series entries,
# the mtype becomes ambiguous, i.e., non-delineable from nested_univ
if isinstance(obj.iloc[0, 0], (pd.Series, pd.DataFrame)):
msg = f"{var_name} cannot contain nested pd.Series or pd.DataFrame"
return ret(False, msg, None, return_metadata) |
Done, can you please check now? |
ok, works |
@fkiraly I'll try to get back to this PR on the weekend. Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype. |
tried to fix - failure is incomplete edge case treatment of empty data frames (zero rows or cols) |
@fkiraly making it as ready, all tests pass after that additional check you added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, works now and I'd be happy with it
The change I made was the additional check in now lines 66-72 of _series._check
, which allows the system to delineate nested_univ
from pd.DataFrame
mtype, at least in all test cases. Empty pd.DataFrame
count as the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I noticed multiple things that imo we need to address before this:
Panel
andHierarchical
types still forbidobject
- we are not systematically testing this input type - from now, users will not receive clear warning.
- possibly, many estimators will break if given
object
dtype
-d data - we need to think about what to do. Options: some kind of coercion; having estimator tags which tell the user whichdtypes
are supported, and an informative message is raised based on that.- none seems fully satisfactory, as this is very
pandas
specific.
- none seems fully satisfactory, as this is very
I actually skipped this one intentionally. I thought it may be better to allow separately for different types. Unless you think they are interlinked and can cause issues, I really have very little, if any, familiarity with how scitype/mtype/etc. works in sktime?
This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular.
I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users? |
Yes, they are linked through broadcasting. One potential failure case that is not covered due to lack of examples for the contract expansion is as follows: suppose we have a transformer that internally is implemented only for Similarly, there is now an inconsistency the other way round - even if an estimator would now accept an individual series, it would no longer accept it if it is found with others of similar type in a collection of series. PS: I think for |
I do think ew should support it, but "properly".
|
Consider the situation of an estimator that does not support the type genuinely, i.e., would break internally of Suppose a user passes a However, if we widen the contract without testing and adding catches where they fail, then in this estimator something random will break, and the error will be uninformative, e.g., a From a higher poin of view, by widening the contract, we may now be creating uncovered bugs. So either we have to move the contract boundary "further in" by one of the measures proposed above (e.g., tag and catch), or we have to ensure all estimators satisfy the new contract. In any case, it has to be tested. |
11a93c3
to
1477012
Compare
well, that's strange. One of the notebooks fails, while the test suite is fine. We need to identify what object is being passed to the checkers, and why it is ambiguous. Perhaps it has mixed columns? |
oh wait, the problem is actually We just need to add the same trick in For |
to avoid conflicts: do you want ot add the mini-check in |
I just faced weird set of conflicts. Fixed locally, so let me make one last attempt. |
(cherry picked from commit cc3d056)
(cherry picked from commit 2de7d8c)
Ref. sktime#5886 (review) (cherry picked from commit 2cecd6b)
(cherry picked from commit 1477012)
15dddeb
to
0eb9231
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a rebase from main and force push, so I think you will get conflicts. So, if possible can you delete your local branch and pull this?
Also, I do not know where are the other changes you said are necessary to do, so can you please apply that? I'll not modify this branch anymore to avoid future conflicts, deleting from my local.
# check to delineate from nested_univ mtype (Hierarchical) | ||
# pd.DataFrame mtype allows object dtype, | ||
# but if we allow object dtype with Panel entries, | ||
# the mtype becomes ambiguous, i.e., non-delineable from nested_univ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy pasted your comment and tried to edit two terms, but I'm certain this is wrong. So, please review and change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should work though, let's see what the tests say
Reference: #5867 (comment)