[ENH] Improve vectorization performance #4195

hoesler · 2023-02-03T09:44:37Z

Reference Issues/PRs

Split out of #4140
Contributes to #4139

What does this implement/fix? Explain your changes.

This PR primarily improves instance iteration performance in VectorizedDF, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when _get_X_at_index is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

For new estimators

I've added the estimator to the online documentation.
I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

sktime/datatypes/_vectorize.py

fkiraly

Hm, interesting.

Some questions (not yet change requests):

given that this is more performant, would it make sense to have __getitem__ and as_list rely on a private method that carries out creation of a GroupBy object, across rows and potentially cols? In __getitem__ this would use the GroupBy iteration, and in as_list it would use the same aggregation strategy as your modification does?
it shouldn't be too hard to also address multiple columns? A second groupby?
covering the Panel case should also be possible, see my comment above

hoesler · 2023-02-09T11:36:21Z

Taking your questions into account, I though it would make sense to implement __iter__. So I did. But then I realized, that you replaced iteration for vectorization in 0.16 with a new vectorize_est method, which is using index slicing directly and cannot take advantage of my improved iteration. I need to look into that first, before we can see the performance effects again.

hoesler · 2023-02-09T20:13:39Z

I reimplemented vectorize_est taking advantage of fast iteration.

fkiraly · 2023-02-11T22:16:20Z

But then I realized, that you replaced iteration for vectorization in 0.16 with a new vectorize_est method, which is using index slicing directly and cannot take advantage of my improved iteration. I need to look into that first, before we can see the performance effects again.

Apologies for crossing over here - but I see you looked into it and ... clearly understood the design, made it faster, further refactored and improved it, fixed some bugs on the side ... AMAZING!!!

fkiraly · 2023-02-11T22:17:34Z

PS: when you're done reworking after a review, click this:

fkiraly · 2023-02-11T22:23:11Z

sktime/datatypes/_vectorize.py

+
+        Returns
+        -------
+        An iterator over all (row name, column name, instance) tuples.


this is not 100% correct - it doesn't return an iterator, but a generator, no? Also, is this not going to get us into trouble when we, say, want to iterate through it twice?

Update, note to myself: I think no, because each call a new generator is returned.

To the cite python docs:

Python’s generators provide a convenient way to implement the iterator protocol. If a container object’s iter() method is implemented as a generator, it will automatically return an iterator object (technically, a generator object) supplying the iter() and next() methods. More information about generators can be found in the documentation for the yield expression.

https://docs.python.org/3/library/stdtypes.html#generator-types

And yes, each time a new generator is returned.

well, we should at least update the docstring then for it to be correct. It returns a generator. Which is an iterator, but iterator is not a type.

I hope I didn't just end up confusing myself, so please correct me if I'm wrong.

Technically, you are correct. But a generator behaves like an iterator and in my opinion, this is an implementation detail, a consumer of the method shouldn't care about. Don't you think so?

I do, but docstrings should be correct up to the type. A consumer will not care in most cases, but I think it's good practice to be technically correct.

How about saying sth like generator, iterates over all (row name, column name, instance) tuples. The i-th element returned by the generator is ...

ok, I'll merge it, and we can talk the docstring separately

sktime/datatypes/_vectorize.py

fkiraly

See above - amazing improvement.

I'll leave it unmerged in case you want to action any of the comments.

This improves the docstringd for `VectorizedDF.items` and `.__iter__`, see discussion in #4195

…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228) Followup to #4195 Contributes to #4139 This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`. Removes the last reference to `VectorizedDF.__getitem__`. Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195). Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.

hoesler added 4 commits February 3, 2023 10:34

fix memory issue

1cac6ba

improve VectorizedDF.as_list

469366a

fix test errors

58b11a3

fix black formatting errors

fdfd79b

hoesler requested a review from fkiraly as a code owner February 3, 2023 09:44

fkiraly reviewed Feb 3, 2023

View reviewed changes

sktime/datatypes/_vectorize.py Show resolved Hide resolved

fkiraly reviewed Feb 3, 2023

View reviewed changes

sktime/datatypes/_vectorize.py Outdated Show resolved Hide resolved

fkiraly requested changes Feb 3, 2023

View reviewed changes

fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality module:base-framework BaseObject, registry, base framework labels Feb 3, 2023

implement __iter__

09ba6b6

hoesler added 2 commits February 9, 2023 15:37

add items method

acf132d

reimplement vectorize_est using items iteration

52a7b09

fix iteration over args_rowvec and include estimators

e023c12

fkiraly self-requested a review February 11, 2023 22:17

fkiraly reviewed Feb 11, 2023

View reviewed changes

sktime/datatypes/_vectorize.py Show resolved Hide resolved

fkiraly approved these changes Feb 11, 2023

View reviewed changes

fkiraly merged commit c41efee into sktime:main Feb 12, 2023

This was referenced Feb 12, 2023

[DOC] improve docstring for VectorizedDF.items and .__iter__ #4223

Merged

[ENH] Improve panel mtype check performance #4196

Merged

hoesler mentioned this pull request Feb 13, 2023

[ENH] Improve vectorized metric calculation and deprecate VectorizedDF.__getitem__ and VectorizedDF.get_iloc_indexer #4228

Merged

5 tasks

hoesler deleted the improve-vectorization branch February 13, 2023 11:34

fkiraly added a commit that referenced this pull request Feb 22, 2023

[DOC] improve docstring for VectorizedDF.items and .__iter__ (#4223)

073d7f0

This improves the docstringd for `VectorizedDF.items` and `.__iter__`, see discussion in #4195

fkiraly mentioned this pull request Jun 7, 2023

[BUG] column-vectorized forecaster predictions are in wrong order if input is DataFrame with non-lexicographically ordered columns #4683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Improve vectorization performance #4195

[ENH] Improve vectorization performance #4195

hoesler commented Feb 3, 2023

fkiraly left a comment

hoesler commented Feb 9, 2023

hoesler commented Feb 9, 2023

fkiraly commented Feb 11, 2023

fkiraly commented Feb 11, 2023

fkiraly Feb 11, 2023 •

edited

fkiraly Feb 11, 2023

hoesler Feb 12, 2023

fkiraly Feb 12, 2023 •

edited

fkiraly Feb 12, 2023

hoesler Feb 12, 2023 •

edited

fkiraly Feb 12, 2023

fkiraly Feb 12, 2023

fkiraly left a comment

[ENH] Improve vectorization performance #4195

[ENH] Improve vectorization performance #4195

Conversation

hoesler commented Feb 3, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

For new estimators

fkiraly left a comment

Choose a reason for hiding this comment

hoesler commented Feb 9, 2023

hoesler commented Feb 9, 2023

fkiraly commented Feb 11, 2023

fkiraly commented Feb 11, 2023

fkiraly Feb 11, 2023 • edited

Choose a reason for hiding this comment

fkiraly Feb 11, 2023

Choose a reason for hiding this comment

hoesler Feb 12, 2023

Choose a reason for hiding this comment

fkiraly Feb 12, 2023 • edited

Choose a reason for hiding this comment

fkiraly Feb 12, 2023

Choose a reason for hiding this comment

hoesler Feb 12, 2023 • edited

Choose a reason for hiding this comment

fkiraly Feb 12, 2023

Choose a reason for hiding this comment

fkiraly Feb 12, 2023

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly Feb 11, 2023 •

edited

fkiraly Feb 12, 2023 •

edited

hoesler Feb 12, 2023 •

edited