[ENH] sktime-pywatts integration - `call` dunder design #2654

fkiraly · 2022-05-17T00:00:04Z

Part of pywatts integration design questions.

pywatts has a nice keras-like __call__ dunder syntax for creating pipelines.

It is, however, not entirely clear:

which design we should follow - single layer (modules and estimators "fused") or dual layer (pywatts modules vs sktime estimators)
whether __call__ should live directly in sktime, or in pywatts. If not in sktime, how could it possibly work after import?

Some preliminary thoughts are in this document: https://github.com/sktime/community-org/blob/main/meetups/previous_meetings/20220506-pywatts.md

FYI @aiwalter

The text was updated successfully, but these errors were encountered:

fkiraly · 2022-05-17T00:16:22Z

FYI, this may interest you, @miraep8, due to connection with dunders/pipelines.

benHeid · 2022-05-18T11:13:33Z

Concerning the second bullet point, I just had an idea to make the __call__ dunder import-dependent. However, I suppose this solution is a bit hacky... and also confusing to users because we probably could not customize the exception message if the users try to use the __call__ dunder without having installed pyWATTS.

# CASE 1: The bases contain the JointAPIBase since the package some_non_installed is not installed

class OtherBases():
    def print_1(self):
        print(1)

class JointAPIBase():
    def print_2(self):
        print(2)

try:
    import os # some_non_installed
    bases = [JointAPIBase, OtherBases]
except ImportError as error:
    bases = [OtherBases]

class BaseSKtime(*bases):
    pass

bar = BaseSKtime()
bar.print_1()
try:
    bar.print_2()
except Exception:
    pass


# CASE 2: The bases does not contain the JointAPIBase since the package some_non_installed is not installed


try:
    import some_non_installed
    bases = [JointAPIBase, OtherBases]
except ImportError as error:
    bases = [OtherBases]

class BaseSKtime(*bases):
    pass

bar = BaseSKtime()
bar.print_1()
try:
    bar.print_2()
except Exception:
    print("Print2 does not exist")
    pass

some_non_installed and os would be replaced by pyWATTS and print_2 by __call__. This would allow that depending on whether the user has installed pyWATTS the sktime module has a __call__ dunder or not.

fkiraly · 2022-06-22T12:26:22Z

Added material:
some designs are here in the STEP: sktime/enhancement-proposals#20

from hackmd on June 14, we extracted four design decision elements:

(a) __call__ return: StepInformation vs GraphPipeline
(b) does StepInformation still exist under the hood? If (a) was GraphPipeline
( c) input subsetting and output subsetting, what is syntax?
- input subsetting syntax options:
  - there is no input subsetting. So, extra "Pipeline" object used as a first element, acts as IdentityTransformer that allows to consider input subsetting as output subsetting of a preceding identity element.
  - Column object passed to __call__
  - ColumnSelectorTransformer used as first element
- output subsetting syntax options:
  - ColumnSelectorTransformer used at the end
  - square bracket dunder, equivalent to post-concatenating ColumnSelectorTransformer
  - output column selection argument in __call__
- all take as args string/int or index or list of string/int (columns to subset to)
  - current pywatts only allows only single variable (string)
(d) does passing pipelines through call dunders modify the state/composition of pipelines? I.e., "pure" interface, or "impure"?
- yes, __call__ modifies pipelines inside, "build-by-modification"
- no, __call__ does not modify pipeline, "build-by-clone"
- no, __call__ does not modify pipeline, "build-by-reference"

fkiraly · 2022-06-22T12:26:57Z

some incomplete opinions on pros/cons

modifying structure of estimators is against user expectation in sklearn. I would go with (d) = no
sklearn has (d) = no, by clone - this is the user expectation. pywatts does by reference.
- Benedikt: we may want to use pretrained components in a pipeline
- FK: would do this with a Pretrained wrapper, which does nothing in fit but retrieves
- pywatts does that via computation_mode argument, in the call dunder. If value is "transform", fitting is skipped. Similar "refit"
- FK: sktime has some similar concepts - FitInTransform, or RefitEvery - these are not call args but wrappers
if (a) = GraphPipeline, we could lose internal StepInformation, but it probably still makes sense to have two layers

SMEISEN · 2022-06-28T08:01:14Z

Ideas

(a) __call__ return: StepInformation vs GraphPipeline
- StepInformation is an Object with the Attributes Step and Pipeline
  connecting steps is performed like the current pyWATTS implementation but StepInformation.Pipeline.train(data) enables training the (sub-)pipeline
( c) input subsetting syntax option: directly pass data column, i.e. interpolated = LinearInterpolation()(x=data["bar"])

advantages:
- kind of intuitive handling of inputs
- no additional imports required
disadvantages:
- pipeline.train() would be empty? Perhaps unintuitive.

SMEISEN · 2022-06-28T13:21:46Z

Thoughts

would prefer input and output subsetting without dummy objects (Column(), ColumnSelectorTransformer, etc.) to simplify the API and avoid importing these additional objects.
my preferred option for subsetting is "all take as args string/int or index or list of string/int (columns to subset to)"
output subsetting by using the square bracket dunder (__getattr__): selecting the output of a step with multiple outputs. What is the naming convention for these outputs?

benHeid · 2022-06-28T13:26:03Z

(a)
- Drawbacks return Graphpipeline:
  - Each pipeline has only one output or last step (This could probably depend on the choice on decision point (d)).
- Benefits return Graphpipeline:
  - More intuitive to use?
(b) Input Subsetting
- Persional Opinion: I would prefer to use a dummy object (Column()) or a string, because:
  - this is probably an operation which is often used. Thus, using a ColumnSelectorTransformer could be cause a lot of code in the pipeline definition and could make the pipeline defintion confusing
  - the square bracket dunder (__getattr__) is used in pandas to select a column of a dataframe. Transferred to the pipeline, it would mean that with the square brackets one can select values the pipeline stands for and for me this are the output data.
  - Using a string or a dummy object is in some way analog to the API of Keras.
(c) Output Subsetting:
- Personal opinion: I would prefer the square bracket dunder because:
  - I think there will be a lot of arguments in the call dunder for controlling the step execution and we should pay attention to use as few arguments as possible.
  - I think it is in some way analog to pandas here, since the Pipeline/StepInformation here stands for the data of the corresponding Transformer or Pipeline.
  - Simiilar as for the Input Subsetting, I could imagine that this functionality is often used.
(d)
- I suppose "build-by-reference" or "build-by-clone" has less side effects.
- However, a solution for pretrained models has to be found:
  - For possible solutions, see FK's comment above.

miraep8 · 2022-06-29T12:33:19Z

(c) regarding subsetting:

Personally, I would agree with @benHeid that it makes sense to use a Dummy variable for input subsetting. I just like the idea of separating the specification of a model from the data it is being used to model.
I think there is an additional advantage here of making it easier to specify and share models across problems as well.

(d) Here is an idea:

A potentially interesting way to use "build-by-clone" but still get the advantages of a pre-trained model could be to add in the ability to save and share fitted parameters in each of our pipelines and models. This is an approach that obviously has some traction already in the Neural Nets world (especially with regard to Transformers) but I think we could benefit from it here as well. Ultimately I think it could help make it easier for us to facilitate the "load pre-trained parameters" functionality, and for people to share models and results with each other as well! Could also be a helpful practice to put into place now since it seems like sktime will soon have more deep learning components and something like parameter saving and sharing seems like it would be extra useful in that context.

fkiraly · 2022-06-30T15:18:30Z

thoughts on "pipelining fitted models".

this could be done using a FittedModel class, which:

as init arg, has the location of a serialized trained model
in fit, unpacks the serialized model, and ignores the data
in predict and other methods, behaves as usual

The serialized model arg to the init is polymorphic, and could be:

a fitted model object
a pickled fitted model object
serialized fitted object at a HD location

So, I would stick to the sklearn build-by-clone design, but combine it with the FittedModel paradigm.
Why that works to mimick build-by-reference: the reference is now in the init arg, which is not reset by a clone!

fkiraly · 2022-06-30T15:20:54Z

@miraep8, could you be more concrete of how your implementation to "load fitted parameters" would work?
Were you thinking of an extra method, a feature of the call dunder, or sth else?

fkiraly · 2022-06-30T16:28:57Z

design study of FittedModel here: #2903

miraep8 · 2022-07-01T10:48:22Z

I checked out the FittedModel PR and I think this approach seems reasonable.

I guess one question is - I suppose the 'model' that we are fitting/reloading will be a GraphPipeline estimator/object? So then the question for me is - do we need to create a new class that is this FittedModel, or can we just use the GraphPipeline class, but have the option of generating this from unpacking a pickle?

miraep8 · 2022-07-01T10:53:31Z

Ahhh, I see. The advantage of having the FittedModel as a separate class is that on fit we can skip fitting the parameters. Which does not happen if we just treat it as a normal GraphPipeline object.

I guess this could get us to a separate question of - if a part of the model is already fitted, we could skip re-fitting it, and instead only update the fitted parameters if update is called.

But I understand the design choice better now of making a new class.

fkiraly · 2022-07-01T11:29:18Z

I guess one question is - I suppose the 'model' that we are fitting/reloading will be a GraphPipeline estimator/object?

Not necessarily - could be simply a neural network with a tensorflow back-end.

do we need to create a new class that is this FittedModel, or can we just use the GraphPipeline class, but have the option of generating this from unpacking a pickle?

We could, but that feels a bit like "a class that does too many things". The pattern also does not seem GraphPipeline specific, see above, neural networks are another example. The other extreme of choices would be "every class has it", i.e., fusing it not with a particular class, but the base class. I also don´t like that, for reasons of separating concerns.

I guess this could get us to a separate question of - if a part of the model is already fitted, we could skip re-fitting it, and instead only update the fitted parameters if update is called.

In the class I wrote, fit also updates (depending on a hyperparameter).

fkiraly · 2022-07-01T12:13:35Z

Note from @kalebphipps

simplicity is important for users - e.g., dunders, square brackets, dummy objects
putting things "under the hood" if more intuitive is good. Should not require more complex workflows like "build stepinformation and then wrap it" etc

fkiraly · 2022-07-01T12:16:26Z

@miraep8:

prefers "everything is GraphPipeline"
sees that "build model" might be necessary and interim objects that have "multiple ends"
could GraphPipeline also have "multiple ends" in scope and construct only when final object has a "type"
do GraphPipeline need to be constructed in "forward" direction?
maybe need separate assembly/build process

fkiraly · 2022-07-01T12:17:24Z

from entire discussion, identified design point not dealt with:

if GraphPipeline, what is the status of "incomplete network", with multiple ends - corresponding to incomplete keras network pre-build

is a GraphPipeline, scope can be larger
is a different object, needs to be "built"
do we always need to build sequentially?

possible choice for dealing with "intermediate" pipelines:

returns only output of the element which was last added
- that, plus output_component parameter like in the multiplexer
polymorphic, incomplete pipelines are like stepinformation

fkiraly · 2022-07-01T19:06:12Z

here's a suggestion on this question:

whether __call__ should live directly in sktime, or in pywatts. If not in sktime, how could it possibly work after import?

I think a good solution would be the following:

GraphPipeline lives in pywatts, but is sktime compatible
pywatts is an sktime soft dependency, localized to __call__
sktime is a pywatts soft dependency, for base class (parent of GraphPipeline?) and testing
testing for GraphPipeline uses sktime check_estimator
sktime base classes implement high-level __call__ logic, i.e., constructing the GraphPipeline
sktime base classes import pywatts GraphPipeline internally in __call__, errors are raised via _check_soft_dependency. This is similar to how the other dunders import a specific compositor class, e.g., * of BaseTransformer importing the TransformerPipeline.

@benHeid

…ing (#2907) This PR adds column subsetting functionality to transformers via the `[ ]` (`__getitem__`) dunder. Similar to how this works in `pywatts`, using the `[ ]` dunder will change any transformer to additionally subsetting columns of outputs (first key) and/or inputs (second key). This is done by pipelining with the existing `ColumnSelect` transformer, using the existing `TransformerPipeline`. Examples: `my_trafo["a"]` is shorthand for `ColumnSelect("a") * my_trafo`, which in turn is shorthand for `TransformerPipeline([ColumnSelect("a"), my_trafo])` `my_trafo[idx1, idx2]` is shorthand for `ColumnSelect(idx1) * my_trafo * ColumnSelect(idx2)`, which in turn is shorthand for `TransformerPipeline([ColumnSelect(idx1), my_trafo, ColumnSelect(idx2)])` Includes tests. Relies on: * #2906, so subsetting like `my_trafo["a"]` is valid (same as `my_trafo[["a"]]`) * #2958, for an example where input subsetting and output subsetting do not commute * #2959, since vectorization for the `ThetaLinesTransformer` in a pipeline was not working Credit to `pywatts` for the idea - thanks, @benHeid, @SMEISEN, @kalebphipps! Also see discussion here: #2654

benHeid · 2024-01-25T20:59:51Z

We decided to use a design without the__dunder__. Thus I am closing this issue.

Feel free to reopen it, if you disagree.

fkiraly added this to In progress in Graphpipeline (sktime-pywatts) May 17, 2022

fkiraly mentioned this issue May 17, 2022

[ENH] sktime-pywatts integration - umbrella issue #2653

Closed

8 tasks

This was referenced Jul 1, 2022

[ENH] __getitem__ aka [ ] dunder for transformers, column subsetting #2907

Merged

[ENH] sktime-pywatts integration - package and class dependency mechanics #2956

Closed

benHeid mentioned this issue Jan 19, 2023

Implement new API KIT-IAI/pywatts-pipeline#28

Closed

fkiraly moved this from In progress to To do in Graphpipeline (sktime-pywatts) Mar 3, 2023

benHeid closed this as completed Jan 25, 2024

benHeid moved this from To do to Done in Graphpipeline (sktime-pywatts) Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] sktime-pywatts integration - `call` dunder design #2654

[ENH] sktime-pywatts integration - `call` dunder design #2654

fkiraly commented May 17, 2022 •

edited

fkiraly commented May 17, 2022

benHeid commented May 18, 2022 •

edited

fkiraly commented Jun 22, 2022

fkiraly commented Jun 22, 2022

SMEISEN commented Jun 28, 2022 •

edited

SMEISEN commented Jun 28, 2022 •

edited

benHeid commented Jun 28, 2022 •

edited

miraep8 commented Jun 29, 2022

fkiraly commented Jun 30, 2022 •

edited

fkiraly commented Jun 30, 2022

fkiraly commented Jun 30, 2022

miraep8 commented Jul 1, 2022

miraep8 commented Jul 1, 2022 •

edited

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022 •

edited

fkiraly commented Jul 1, 2022

benHeid commented Jan 25, 2024

[ENH] sktime-pywatts integration - __call__ dunder design #2654

[ENH] sktime-pywatts integration - __call__ dunder design #2654

Comments

fkiraly commented May 17, 2022 • edited

fkiraly commented May 17, 2022

benHeid commented May 18, 2022 • edited

fkiraly commented Jun 22, 2022

fkiraly commented Jun 22, 2022

SMEISEN commented Jun 28, 2022 • edited

SMEISEN commented Jun 28, 2022 • edited

benHeid commented Jun 28, 2022 • edited

miraep8 commented Jun 29, 2022

fkiraly commented Jun 30, 2022 • edited

fkiraly commented Jun 30, 2022

fkiraly commented Jun 30, 2022

miraep8 commented Jul 1, 2022

miraep8 commented Jul 1, 2022 • edited

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022

fkiraly commented Jul 1, 2022 • edited

fkiraly commented Jul 1, 2022

benHeid commented Jan 25, 2024

[ENH] sktime-pywatts integration - `call` dunder design #2654

[ENH] sktime-pywatts integration - `call` dunder design #2654

fkiraly commented May 17, 2022 •

edited

benHeid commented May 18, 2022 •

edited

SMEISEN commented Jun 28, 2022 •

edited

SMEISEN commented Jun 28, 2022 •

edited

benHeid commented Jun 28, 2022 •

edited

fkiraly commented Jun 30, 2022 •

edited

miraep8 commented Jul 1, 2022 •

edited

fkiraly commented Jul 1, 2022 •

edited