New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Improve vectorized metric calculation and deprecate VectorizedDF.__getitem__ and VectorizedDF.get_iloc_indexer #4228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an interesting question whether we should do this.
On one hand, good idea as a refactor step to move away from __getitem__
, this would be the last step of your intended refactor, I suppose.
You have already isolated the less efficient interfaces and replaced them with the group based one.
On the other hand, I wonder whether partial or random access is sth we want later, if we add support for distributed computing, see, e.g., #4013
I would recommend to keep that in mind as the northstar for VectorizedDF
, i.e., lazy computation which passes through distributed objects and inverts the application (estimator within/from dask
frame rather than the other way round)
Do you have any thoughts on that?
(from my perspective: this is a blocker until discussed - not necessarily for actioning)
I think it’s good to remove unused code in order to prevent maintenance overhead until you really know the requirements. Having said that, you raise an interesting question. Actually, I don’t think there is good use for for random access from the perspective of a consumer of a distributed data object. Besides ordering guarantees in such systems might be absent or costly, you don’t want to fetch the data, as you said, but tell the backend do do something with it and return the result, which again will be distributed data at first. In the end, when small enough, you convert it into a pandas frame, list of model (references) etc. and fetch it (iteratively). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suppose you are right. dask
would result in a delayed groupby
object and a mapping applicate or similar, and we still would use sth similar to items
.
So we are agreed about the end state, removing this.
Now, I recalled that VectorizedDF
is public, so we should not make a change to its interfaces without deprecation. May I suggest the following alternative, hence?
- we patch
__getitem__
through toitems
, i.e., a full call and then thei
-th element or similar - at this opportunity, I would also suggest to change
items
so that the third tuple element (the subsetted dataframe) is an optional return, e.g., via an additional arg. Not necessary for this PR, but this would cover functionality that is removed and may be useful later.
This does not need deprecation, as the external behaviour and interface does not change. Otherwise we would have start to give out warnings when __getitem__
is called, wait for version 0.18.0, etc.
FYI, deprecation policy is here: |
Uh, thanks for clarifying. I didn't realize it is getting exported in init.py.
Although just a very temporary workaround, patching
I wouldn't squeeze to much use cases into one function. I'd rather add a |
I just thought it would be the quickest way to wrap this up, i.e., just call
That's also fine with me.
Excellent idea! I infer that you want to move this to a dictionary-like design. Also an excellent (design) idea! |
be0c8c2
to
4850eb2
Compare
You were right. I finally decided a more sophisticated implementation is not worth the effort and delegating to
Yeah, sort of. But on second thought, this might not be a good idea. For example having a keys method might direct to a ... |
Well, you need one of Or, do you mean why it is public? Because it's used by the equally public base classes - imo it's "cleaner" to maintain a public dependency than a private one, even if it is just used internally by the base classes. |
I meant the latter. Thanks for the explanation, although I must admit that I don't understand the reasoning. From my gut feeling I would say fewer public classes of functions mean less to maintain and therefore more freedom to change things. |
Well, my gut feeling is about stability of interfaces. The base classes are quite important, and they depend on |
Ah, okay. Yeah, makes sense now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Could you kindly split out the work on the forecasting metrics in another PR? Same as before, keeping PR modular.
Also, good spot that we can use VectorizedDF
there!
Regarding the deprecation, why are you suggesting to deprecate __getitem__
? I'd say, let's keep it. get_iloc_indexer
also doesn't hurt (may be useful if you have two vectorized objects).
Ok.
Fine for me. I was seeing this as a cleanup, with the reasoning I gave earlier: Removal of unused code to reduce maintenance overhead and a fear that people still might iterate in unintended ways (the eval implementation as an example of this). And I thought we agreed on this. But after rereading your message above, I noticed that I missed your advise against deprecation. |
Sorry, I wasn't really clear with my thoughts here. I would recommend we keep |
@hoesler, haven't heard back, so I hope you would be ok with the following solution here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above - as-is, minus one deprecation
Reference Issues/PRs
Followup to #4195
Contributes to #4139
What does this implement/fix? Explain your changes.
This PR implements
BaseForecastingErrorMetric._evaluate_vectorized
usingVectorizedDF.vectorize_est
.Further, by removing with it the last reference to
VectorizedDF.__getitem__
, it deprecates this method. Random access is not needed, and developers should use__iter__
for iteration instead (implemented in #4195).Also, unused method
get_iloc_indexer
is marked as deprecated for the same reason.Does your contribution introduce a new dependency? If yes, which one?
What should a reviewer concentrate their feedback on?
Did you add any tests for the change?
Any other comments?
PR checklist
For all contributions
For new estimators