Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve vectorization performance #4195

Merged
merged 8 commits into from Feb 12, 2023

Conversation

hoesler
Copy link
Contributor

@hoesler hoesler commented Feb 3, 2023

Reference Issues/PRs

Split out of #4140
Contributes to #4139

What does this implement/fix? Explain your changes.

This PR primarily improves instance iteration performance in VectorizedDF, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when _get_X_at_index is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@hoesler hoesler requested a review from fkiraly as a code owner February 3, 2023 09:44
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting.

Some questions (not yet change requests):

  • given that this is more performant, would it make sense to have __getitem__ and as_list rely on a private method that carries out creation of a GroupBy object, across rows and potentially cols? In __getitem__ this would use the GroupBy iteration, and in as_list it would use the same aggregation strategy as your modification does?
  • it shouldn't be too hard to also address multiple columns? A second groupby?
  • covering the Panel case should also be possible, see my comment above

@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality module:base-framework BaseObject, registry, base framework labels Feb 3, 2023
@hoesler
Copy link
Contributor Author

hoesler commented Feb 9, 2023

Taking your questions into account, I though it would make sense to implement __iter__. So I did. But then I realized, that you replaced iteration for vectorization in 0.16 with a new vectorize_est method, which is using index slicing directly and cannot take advantage of my improved iteration. I need to look into that first, before we can see the performance effects again.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 9, 2023

I reimplemented vectorize_est taking advantage of fast iteration.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 11, 2023

But then I realized, that you replaced iteration for vectorization in 0.16 with a new vectorize_est method, which is using index slicing directly and cannot take advantage of my improved iteration. I need to look into that first, before we can see the performance effects again.

Apologies for crossing over here - but I see you looked into it and ... clearly understood the design, made it faster, further refactored and improved it, fixed some bugs on the side ... AMAZING!!!

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 11, 2023

PS: when you're done reworking after a review, click this:

image

@fkiraly fkiraly self-requested a review February 11, 2023 22:17

Returns
-------
An iterator over all (row name, column name, instance) tuples.
Copy link
Collaborator

@fkiraly fkiraly Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not 100% correct - it doesn't return an iterator, but a generator, no? Also, is this not going to get us into trouble when we, say, want to iterate through it twice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update, note to myself: I think no, because each call a new generator is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the cite python docs:

Python’s generators provide a convenient way to implement the iterator protocol. If a container object’s iter() method is implemented as a generator, it will automatically return an iterator object (technically, a generator object) supplying the iter() and next() methods. More information about generators can be found in the documentation for the yield expression.

https://docs.python.org/3/library/stdtypes.html#generator-types

And yes, each time a new generator is returned.

Copy link
Collaborator

@fkiraly fkiraly Feb 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we should at least update the docstring then for it to be correct. It returns a generator. Which is an iterator, but iterator is not a type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I didn't just end up confusing myself, so please correct me if I'm wrong.

Copy link
Contributor Author

@hoesler hoesler Feb 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, you are correct. But a generator behaves like an iterator and in my opinion, this is an implementation detail, a consumer of the method shouldn't care about. Don't you think so?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, but docstrings should be correct up to the type. A consumer will not care in most cases, but I think it's good practice to be technically correct.

How about saying sth like generator, iterates over all (row name, column name, instance) tuples. The i-th element returned by the generator is ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll merge it, and we can talk the docstring separately

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above - amazing improvement.

I'll leave it unmerged in case you want to action any of the comments.

@fkiraly fkiraly merged commit c41efee into sktime:main Feb 12, 2023
@hoesler hoesler deleted the improve-vectorization branch February 13, 2023 11:34
fkiraly added a commit that referenced this pull request Feb 22, 2023
This improves the docstringd for `VectorizedDF.items` and `.__iter__`, see discussion in #4195
fkiraly pushed a commit that referenced this pull request Mar 7, 2023
…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228)

Followup to #4195
Contributes to #4139

This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`.
Removes the last reference to `VectorizedDF.__getitem__`.
Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195).
Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:base-framework BaseObject, registry, base framework module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants