Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] sktime-pywatts integration - __call__ dunder design #2654

Closed
fkiraly opened this issue May 17, 2022 · 19 comments
Closed

[ENH] sktime-pywatts integration - __call__ dunder design #2654

fkiraly opened this issue May 17, 2022 · 19 comments
Labels
API design API design & software architecture enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented May 17, 2022

Part of pywatts integration design questions.

pywatts has a nice keras-like __call__ dunder syntax for creating pipelines.

It is, however, not entirely clear:

  • which design we should follow - single layer (modules and estimators "fused") or dual layer (pywatts modules vs sktime estimators)
  • whether __call__ should live directly in sktime, or in pywatts. If not in sktime, how could it possibly work after import?

Some preliminary thoughts are in this document: https://github.com/sktime/community-org/blob/main/meetups/previous_meetings/20220506-pywatts.md

FYI @aiwalter

@fkiraly fkiraly added API design API design & software architecture module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality labels May 17, 2022
@fkiraly fkiraly added this to In progress in Graphpipeline (sktime-pywatts) May 17, 2022
@fkiraly
Copy link
Collaborator Author

fkiraly commented May 17, 2022

FYI, this may interest you, @miraep8, due to connection with dunders/pipelines.

@benHeid
Copy link
Contributor

benHeid commented May 18, 2022

Concerning the second bullet point, I just had an idea to make the __call__ dunder import-dependent. However, I suppose this solution is a bit hacky... and also confusing to users because we probably could not customize the exception message if the users try to use the __call__ dunder without having installed pyWATTS.

# CASE 1: The bases contain the JointAPIBase since the package some_non_installed is not installed

class OtherBases():
    def print_1(self):
        print(1)

class JointAPIBase():
    def print_2(self):
        print(2)

try:
    import os # some_non_installed
    bases = [JointAPIBase, OtherBases]
except ImportError as error:
    bases = [OtherBases]

class BaseSKtime(*bases):
    pass

bar = BaseSKtime()
bar.print_1()
try:
    bar.print_2()
except Exception:
    pass


# CASE 2: The bases does not contain the JointAPIBase since the package some_non_installed is not installed


try:
    import some_non_installed
    bases = [JointAPIBase, OtherBases]
except ImportError as error:
    bases = [OtherBases]

class BaseSKtime(*bases):
    pass

bar = BaseSKtime()
bar.print_1()
try:
    bar.print_2()
except Exception:
    print("Print2 does not exist")
    pass

some_non_installed and os would be replaced by pyWATTS and print_2 by __call__. This would allow that depending on whether the user has installed pyWATTS the sktime module has a __call__ dunder or not.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 22, 2022

Added material:
some designs are here in the STEP: sktime/enhancement-proposals#20

from hackmd on June 14, we extracted four design decision elements:

  • (a) __call__ return: StepInformation vs GraphPipeline
  • (b) does StepInformation still exist under the hood? If (a) was GraphPipeline
  • ( c) input subsetting and output subsetting, what is syntax?
    • input subsetting syntax options:
      • there is no input subsetting. So, extra "Pipeline" object used as a first element, acts as IdentityTransformer that allows to consider input subsetting as output subsetting of a preceding identity element.
      • Column object passed to __call__
      • ColumnSelectorTransformer used as first element
    • output subsetting syntax options:
      • ColumnSelectorTransformer used at the end
      • square bracket dunder, equivalent to post-concatenating ColumnSelectorTransformer
      • output column selection argument in __call__
    • all take as args string/int or index or list of string/int (columns to subset to)
      • current pywatts only allows only single variable (string)
  • (d) does passing pipelines through call dunders modify the state/composition of pipelines? I.e., "pure" interface, or "impure"?
    • yes, __call__ modifies pipelines inside, "build-by-modification"
    • no, __call__ does not modify pipeline, "build-by-clone"
    • no, __call__ does not modify pipeline, "build-by-reference"

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 22, 2022

some incomplete opinions on pros/cons

  • modifying structure of estimators is against user expectation in sklearn. I would go with (d) = no
  • sklearn has (d) = no, by clone - this is the user expectation. pywatts does by reference.
    • Benedikt: we may want to use pretrained components in a pipeline
    • FK: would do this with a Pretrained wrapper, which does nothing in fit but retrieves
    • pywatts does that via computation_mode argument, in the call dunder. If value is "transform", fitting is skipped. Similar "refit"
    • FK: sktime has some similar concepts - FitInTransform, or RefitEvery - these are not call args but wrappers
  • if (a) = GraphPipeline, we could lose internal StepInformation, but it probably still makes sense to have two layers

@SMEISEN
Copy link

SMEISEN commented Jun 28, 2022

Ideas

  • (a) __call__ return: StepInformation vs GraphPipeline

    • StepInformation is an Object with the Attributes Step and Pipeline
      connecting steps is performed like the current pyWATTS implementation but StepInformation.Pipeline.train(data) enables training the (sub-)pipeline
  • ( c) input subsetting syntax option: directly pass data column, i.e. interpolated = LinearInterpolation()(x=data["bar"])

    advantages:

    • kind of intuitive handling of inputs
    • no additional imports required

    disadvantages:

    • pipeline.train() would be empty? Perhaps unintuitive.

@SMEISEN
Copy link

SMEISEN commented Jun 28, 2022

Thoughts

  • would prefer input and output subsetting without dummy objects (Column(), ColumnSelectorTransformer, etc.) to simplify the API and avoid importing these additional objects.
  • my preferred option for subsetting is "all take as args string/int or index or list of string/int (columns to subset to)"
  • output subsetting by using the square bracket dunder (__getattr__): selecting the output of a step with multiple outputs. What is the naming convention for these outputs?

@benHeid
Copy link
Contributor

benHeid commented Jun 28, 2022

  • (a)

    • Drawbacks return Graphpipeline:
      • Each pipeline has only one output or last step (This could probably depend on the choice on decision point (d)).
    • Benefits return Graphpipeline:
      • More intuitive to use?
  • (b) Input Subsetting

    • Persional Opinion: I would prefer to use a dummy object (Column()) or a string, because:
      • this is probably an operation which is often used. Thus, using a ColumnSelectorTransformer could be cause a lot of code in the pipeline definition and could make the pipeline defintion confusing
      • the square bracket dunder (__getattr__) is used in pandas to select a column of a dataframe. Transferred to the pipeline, it would mean that with the square brackets one can select values the pipeline stands for and for me this are the output data.
      • Using a string or a dummy object is in some way analog to the API of Keras.
  • (c) Output Subsetting:

    • Personal opinion: I would prefer the square bracket dunder because:
      • I think there will be a lot of arguments in the call dunder for controlling the step execution and we should pay attention to use as few arguments as possible.
      • I think it is in some way analog to pandas here, since the Pipeline/StepInformation here stands for the data of the corresponding Transformer or Pipeline.
      • Simiilar as for the Input Subsetting, I could imagine that this functionality is often used.
  • (d)

    • I suppose "build-by-reference" or "build-by-clone" has less side effects.
    • However, a solution for pretrained models has to be found:
      • For possible solutions, see FK's comment above.

@miraep8
Copy link
Collaborator

miraep8 commented Jun 29, 2022

(c) regarding subsetting:

  • Personally, I would agree with @benHeid that it makes sense to use a Dummy variable for input subsetting. I just like the idea of separating the specification of a model from the data it is being used to model.
  • I think there is an additional advantage here of making it easier to specify and share models across problems as well.

(d) Here is an idea:

  • A potentially interesting way to use "build-by-clone" but still get the advantages of a pre-trained model could be to add in the ability to save and share fitted parameters in each of our pipelines and models. This is an approach that obviously has some traction already in the Neural Nets world (especially with regard to Transformers) but I think we could benefit from it here as well. Ultimately I think it could help make it easier for us to facilitate the "load pre-trained parameters" functionality, and for people to share models and results with each other as well! Could also be a helpful practice to put into place now since it seems like sktime will soon have more deep learning components and something like parameter saving and sharing seems like it would be extra useful in that context.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 30, 2022

thoughts on "pipelining fitted models".

this could be done using a FittedModel class, which:

  • as init arg, has the location of a serialized trained model
  • in fit, unpacks the serialized model, and ignores the data
  • in predict and other methods, behaves as usual

The serialized model arg to the init is polymorphic, and could be:

  • a fitted model object
  • a pickled fitted model object
  • serialized fitted object at a HD location

So, I would stick to the sklearn build-by-clone design, but combine it with the FittedModel paradigm.
Why that works to mimick build-by-reference: the reference is now in the init arg, which is not reset by a clone!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 30, 2022

@miraep8, could you be more concrete of how your implementation to "load fitted parameters" would work?
Were you thinking of an extra method, a feature of the call dunder, or sth else?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 30, 2022

design study of FittedModel here: #2903

@miraep8
Copy link
Collaborator

miraep8 commented Jul 1, 2022

I checked out the FittedModel PR and I think this approach seems reasonable.

I guess one question is - I suppose the 'model' that we are fitting/reloading will be a GraphPipeline estimator/object? So then the question for me is - do we need to create a new class that is this FittedModel, or can we just use the GraphPipeline class, but have the option of generating this from unpacking a pickle?

@miraep8
Copy link
Collaborator

miraep8 commented Jul 1, 2022

Ahhh, I see. The advantage of having the FittedModel as a separate class is that on fit we can skip fitting the parameters. Which does not happen if we just treat it as a normal GraphPipeline object.

I guess this could get us to a separate question of - if a part of the model is already fitted, we could skip re-fitting it, and instead only update the fitted parameters if update is called.

But I understand the design choice better now of making a new class.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 1, 2022

I guess one question is - I suppose the 'model' that we are fitting/reloading will be a GraphPipeline estimator/object?

Not necessarily - could be simply a neural network with a tensorflow back-end.

do we need to create a new class that is this FittedModel, or can we just use the GraphPipeline class, but have the option of generating this from unpacking a pickle?

We could, but that feels a bit like "a class that does too many things". The pattern also does not seem GraphPipeline specific, see above, neural networks are another example. The other extreme of choices would be "every class has it", i.e., fusing it not with a particular class, but the base class. I also don´t like that, for reasons of separating concerns.

I guess this could get us to a separate question of - if a part of the model is already fitted, we could skip re-fitting it, and instead only update the fitted parameters if update is called.

In the class I wrote, fit also updates (depending on a hyperparameter).

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 1, 2022

Note from @kalebphipps

  • simplicity is important for users - e.g., dunders, square brackets, dummy objects
  • putting things "under the hood" if more intuitive is good. Should not require more complex workflows like "build stepinformation and then wrap it" etc

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 1, 2022

@miraep8:

  • prefers "everything is GraphPipeline"
  • sees that "build model" might be necessary and interim objects that have "multiple ends"
  • could GraphPipeline also have "multiple ends" in scope and construct only when final object has a "type"
  • do GraphPipeline need to be constructed in "forward" direction?
  • maybe need separate assembly/build process

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 1, 2022

from entire discussion, identified design point not dealt with:

if GraphPipeline, what is the status of "incomplete network", with multiple ends - corresponding to incomplete keras network pre-build

  • is a GraphPipeline, scope can be larger
  • is a different object, needs to be "built"
  • do we always need to build sequentially?

possible choice for dealing with "intermediate" pipelines:

  • returns only output of the element which was last added
    • that, plus output_component parameter like in the multiplexer
  • polymorphic, incomplete pipelines are like stepinformation

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 1, 2022

here's a suggestion on this question:

whether __call__ should live directly in sktime, or in pywatts. If not in sktime, how could it possibly work after import?

I think a good solution would be the following:

  • GraphPipeline lives in pywatts, but is sktime compatible
  • pywatts is an sktime soft dependency, localized to __call__
  • sktime is a pywatts soft dependency, for base class (parent of GraphPipeline?) and testing
  • testing for GraphPipeline uses sktime check_estimator
  • sktime base classes implement high-level __call__ logic, i.e., constructing the GraphPipeline
  • sktime base classes import pywatts GraphPipeline internally in __call__, errors are raised via _check_soft_dependency. This is similar to how the other dunders import a specific compositor class, e.g., * of BaseTransformer importing the TransformerPipeline.

fkiraly added a commit that referenced this issue Jul 26, 2022
…ing (#2907)

This PR adds column subsetting functionality to transformers via the `[ ]` (`__getitem__`) dunder.

Similar to how this works in `pywatts`, using the `[ ]` dunder will change any transformer to additionally subsetting columns of outputs (first key) and/or inputs (second key).
This is done by pipelining with the existing `ColumnSelect` transformer, using the existing `TransformerPipeline`.

Examples:
`my_trafo["a"]` is shorthand for `ColumnSelect("a") * my_trafo`, which in turn is shorthand for `TransformerPipeline([ColumnSelect("a"), my_trafo])`
`my_trafo[idx1, idx2]` is shorthand for `ColumnSelect(idx1) * my_trafo * ColumnSelect(idx2)`, which in turn is shorthand for `TransformerPipeline([ColumnSelect(idx1), my_trafo, ColumnSelect(idx2)])`

Includes tests.

Relies on:
* #2906, so subsetting like `my_trafo["a"]` is valid (same as `my_trafo[["a"]]`)
* #2958, for an example where input subsetting and output subsetting do not commute
* #2959, since vectorization for the `ThetaLinesTransformer` in a pipeline was not working

Credit to `pywatts` for the idea - thanks, @benHeid, @SMEISEN, @kalebphipps!

Also see discussion here: #2654
@fkiraly fkiraly moved this from In progress to To do in Graphpipeline (sktime-pywatts) Mar 3, 2023
@benHeid
Copy link
Contributor

benHeid commented Jan 25, 2024

We decided to use a design without the__dunder__. Thus I am closing this issue.

Feel free to reopen it, if you disagree.

@benHeid benHeid closed this as completed Jan 25, 2024
@benHeid benHeid moved this from To do to Done in Graphpipeline (sktime-pywatts) Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Development

No branches or pull requests

4 participants