New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Improve vectorization performance #4195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting.
Some questions (not yet change requests):
- given that this is more performant, would it make sense to have
__getitem__
andas_list
rely on a private method that carries out creation of aGroupBy
object, across rows and potentially cols? In__getitem__
this would use theGroupBy
iteration, and inas_list
it would use the same aggregation strategy as your modification does? - it shouldn't be too hard to also address multiple columns? A second
groupby
? - covering the
Panel
case should also be possible, see my comment above
Taking your questions into account, I though it would make sense to implement |
I reimplemented |
Apologies for crossing over here - but I see you looked into it and ... clearly understood the design, made it faster, further refactored and improved it, fixed some bugs on the side ... AMAZING!!! |
|
||
Returns | ||
------- | ||
An iterator over all (row name, column name, instance) tuples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not 100% correct - it doesn't return an iterator, but a generator, no? Also, is this not going to get us into trouble when we, say, want to iterate through it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, note to myself: I think no, because each call a new generator is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the cite python docs:
Python’s generators provide a convenient way to implement the iterator protocol. If a container object’s iter() method is implemented as a generator, it will automatically return an iterator object (technically, a generator object) supplying the iter() and next() methods. More information about generators can be found in the documentation for the yield expression.
https://docs.python.org/3/library/stdtypes.html#generator-types
And yes, each time a new generator is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we should at least update the docstring then for it to be correct. It returns a generator. Which is an iterator, but iterator is not a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I didn't just end up confusing myself, so please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, you are correct. But a generator behaves like an iterator and in my opinion, this is an implementation detail, a consumer of the method shouldn't care about. Don't you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, but docstrings should be correct up to the type. A consumer will not care in most cases, but I think it's good practice to be technically correct.
How about saying sth like generator, iterates over all (row name, column name, instance) tuples. The i-th element returned by the generator is ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll merge it, and we can talk the docstring separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above - amazing improvement.
I'll leave it unmerged in case you want to action any of the comments.
This improves the docstringd for `VectorizedDF.items` and `.__iter__`, see discussion in #4195
…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228) Followup to #4195 Contributes to #4139 This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`. Removes the last reference to `VectorizedDF.__getitem__`. Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195). Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.
Reference Issues/PRs
Split out of #4140
Contributes to #4139
What does this implement/fix? Explain your changes.
This PR primarily improves instance iteration performance in
VectorizedDF
, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when_get_X_at_index
is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).Does your contribution introduce a new dependency? If yes, which one?
What should a reviewer concentrate their feedback on?
Did you add any tests for the change?
Any other comments?
PR checklist
For all contributions
For new estimators