Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve vectorized metric calculation and deprecate VectorizedDF.__getitem__ and VectorizedDF.get_iloc_indexer #4228

Merged
merged 7 commits into from Mar 7, 2023

Conversation

hoesler
Copy link
Contributor

@hoesler hoesler commented Feb 13, 2023

Reference Issues/PRs

Followup to #4195
Contributes to #4139

What does this implement/fix? Explain your changes.

This PR implements BaseForecastingErrorMetric._evaluate_vectorized using VectorizedDF.vectorize_est.
Further, by removing with it the last reference to VectorizedDF.__getitem__, it deprecates this method. Random access is not needed, and developers should use __iter__ for iteration instead (implemented in #4195).
Also, unused method get_iloc_indexer is marked as deprecated for the same reason.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an interesting question whether we should do this.
On one hand, good idea as a refactor step to move away from __getitem__, this would be the last step of your intended refactor, I suppose.
You have already isolated the less efficient interfaces and replaced them with the group based one.

On the other hand, I wonder whether partial or random access is sth we want later, if we add support for distributed computing, see, e.g., #4013

I would recommend to keep that in mind as the northstar for VectorizedDF, i.e., lazy computation which passes through distributed objects and inverts the application (estimator within/from dask frame rather than the other way round)

Do you have any thoughts on that?

(from my perspective: this is a blocker until discussed - not necessarily for actioning)

@hoesler
Copy link
Contributor Author

hoesler commented Feb 13, 2023

I think it’s good to remove unused code in order to prevent maintenance overhead until you really know the requirements.

Having said that, you raise an interesting question. Actually, I don’t think there is good use for for random access from the perspective of a consumer of a distributed data object. Besides ordering guarantees in such systems might be absent or costly, you don’t want to fetch the data, as you said, but tell the backend do do something with it and return the result, which again will be distributed data at first. In the end, when small enough, you convert it into a pandas frame, list of model (references) etc. and fetch it (iteratively).
That’s how I would do it with spark (which has no random access method for rows. And from what I read, this also is true for dask).

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suppose you are right. dask would result in a delayed groupby object and a mapping applicate or similar, and we still would use sth similar to items.

So we are agreed about the end state, removing this.

Now, I recalled that VectorizedDF is public, so we should not make a change to its interfaces without deprecation. May I suggest the following alternative, hence?

  • we patch __getitem__ through to items, i.e., a full call and then the i-th element or similar
  • at this opportunity, I would also suggest to change items so that the third tuple element (the subsetted dataframe) is an optional return, e.g., via an additional arg. Not necessary for this PR, but this would cover functionality that is removed and may be useful later.

This does not need deprecation, as the external behaviour and interface does not change. Otherwise we would have start to give out warnings when __getitem__ is called, wait for version 0.18.0, etc.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 13, 2023

FYI, deprecation policy is here:
https://www.sktime.org/en/stable/developer_guide/deprecation.html
(affecting all public interfaces)

@hoesler
Copy link
Contributor Author

hoesler commented Feb 14, 2023

Now, I recalled that VectorizedDF is public

Uh, thanks for clarifying. I didn't realize it is getting exported in init.py.

we patch getitem through to items, i.e., a full call and then the i-th element or similar

Although just a very temporary workaround, patching __getitem__ through items would be very inefficient. I would use multiindex sclicing.

at this opportunity, I would also suggest to change items so that the third tuple element (the subsetted dataframe) is an optional return, e.g., via an additional arg. Not necessary for this PR, but this would cover functionality that is removed and
may be useful later.

I wouldn't squeeze to much use cases into one function. I'd rather add a keys() method.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

Although just a very temporary workaround, patching __getitem__ through items would be very inefficient. I would use multiindex sclicing.

I just thought it would be the quickest way to wrap this up, i.e., just call items.

I would use multiindex sclicing.

That's also fine with me.

I wouldn't squeeze to much use cases into one function. I'd rather add a keys() method.

Excellent idea! I infer that you want to move this to a dictionary-like design. Also an excellent (design) idea!

@hoesler hoesler force-pushed the improve-vectorized-evaluation branch from be0c8c2 to 4850eb2 Compare February 14, 2023 12:51
@hoesler
Copy link
Contributor Author

hoesler commented Feb 14, 2023

I just thought it would be the quickest way to wrap this up, i.e., just call items.

You were right. I finally decided a more sophisticated implementation is not worth the effort and delegating to __iter__ is safer from a consistency perspective.

Excellent idea! I infer that you want to move this to a dictionary-like design. Also an excellent (design) idea!

Yeah, sort of. But on second thought, this might not be a good idea. For example having a keys method might direct to a ... _getitem__ method, which I just deprecated. Maybe it's indeed best postpone that, until there is really the need for such a method, now knowing that it is public API (What's the reason for that, btw?).

@hoesler hoesler changed the title [ENH] Improve vectorized metric calculation and remove obsolete VectorizedDF.__getitem__ [ENH] Improve vectorized metric calculation and deprecate VectorizedDF.__getitem__ and VectorizedDF.get_iloc_indexer Feb 14, 2023
@hoesler hoesler requested review from fkiraly and removed request for RNKuhns February 14, 2023 13:18
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

(What's the reason for that, btw?).

Well, you need one of __getitem__ or __iter__ or items for the general design. It's just the way I designed it originally - acknowledging that yours is probably the better way to go.

Or, do you mean why it is public?

Because it's used by the equally public base classes - imo it's "cleaner" to maintain a public dependency than a private one, even if it is just used internally by the base classes.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 14, 2023

I meant the latter. Thanks for the explanation, although I must admit that I don't understand the reasoning. From my gut feeling I would say fewer public classes of functions mean less to maintain and therefore more freedom to change things.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

From my gut feeling I would say fewer public classes of functions mean less to maintain and therefore more freedom to change things.

Well, my gut feeling is about stability of interfaces. The base classes are quite important, and they depend on VectorizedDF. There are multiple base classes depending on it (with different sets of developers having an "affinity" to them), so it's a bit like a utility package. Making it stable also makes it more reliable in constructing new base classes which haven't been written yet.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 14, 2023

Ah, okay. Yeah, makes sense now.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Could you kindly split out the work on the forecasting metrics in another PR? Same as before, keeping PR modular.
Also, good spot that we can use VectorizedDF there!

Regarding the deprecation, why are you suggesting to deprecate __getitem__? I'd say, let's keep it. get_iloc_indexer also doesn't hurt (may be useful if you have two vectorized objects).

@hoesler
Copy link
Contributor Author

hoesler commented Feb 14, 2023

Could you kindly split out the work on the forecasting metrics in another PR? Same as before, keeping PR modular.
Also, good spot that we can use VectorizedDF there!

Ok.

Regarding the deprecation, why are you suggesting to deprecate getitem?
I'd say, let's keep it. get_iloc_indexer also doesn't hurt (may be useful if you have two vectorized objects).

Fine for me. I was seeing this as a cleanup, with the reasoning I gave earlier: Removal of unused code to reduce maintenance overhead and a fear that people still might iterate in unintended ways (the eval implementation as an example of this). And I thought we agreed on this. But after rereading your message above, I noticed that I missed your advise against deprecation.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

And I thought we agreed on this. But after rereading your message above, I noticed that I missed your advise against deprecation.

Sorry, I wasn't really clear with my thoughts here.
Agreed to the maintenance burden.

I would recommend we keep __getitem__ and deprecate get_iloc_indexer.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 4, 2023

@hoesler, haven't heard back, so I hope you would be ok with the following solution here:

  • I reverted the deprecation of __getitem__, but left the change in logic (to islice)
  • anything else is unchanged from your last state, i.e., we deprecate get_iloc_indexer
  • this gets merged as-is, except for that minor change

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above - as-is, minus one deprecation

@fkiraly fkiraly merged commit d85102f into sktime:main Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants