Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] design issue - mapping weight handling and fine-tuning for FM on forecaster interface #6580

Open
fkiraly opened this issue Jun 12, 2024 · 17 comments
Labels
API design API design & software architecture enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 12, 2024

Design issue for consolidating thoughts on how to map weight handling and fine-tuning for FM on the forecaster interface.

I've summarized the conceptual model involving fitting and fine-tuning, and the various interface points that we need to match:

fm-conc

Key observations:

  • we need to match fine-tuning and context separately
  • this is one more "fitting" stage than vanilla forecasters usually have
  • an option is the "global forecasting" interface, for fine-tuning
  • added challenge: allowing users to serialize fine-tuned weights

The above is for foundation models to which the vendors do not give the user access to the original training algorithm or corpus.

@fkiraly fkiraly added API design API design & software architecture module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting enhancement Adding new functionality labels Jun 12, 2024
@shlok191
Copy link
Contributor

Hello Franz! I have made good progress on LagLlama and should be done soon. Can I work on this next? Thank you!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 24, 2024

Sure! This is a design issue, so the way to work on this would be to propose interface patterns, with speculative code snippets first, or pointing out design decisions.

@julian-fong
Copy link
Contributor

image

I would like to clarify my understanding of the original diagram - I have created an edited version of the diagram above. It this diagram more or less correct?

  • The 'circle' in red highlights the areas which exists as part of the sktime interface, but come from an outside source. The outside source presumably designs a completely new pristine model architecture, trains it on a large training corpus and uploads it onto the model hub, allowing any user to download it for their own use, or for any other further interfacing

    • This means the deep learning interface comes with a pre-trained model (i.e a global forecasting model with weights) already that the sktime user has access to
  • The 'circle' in green represents what the sktime user can do with the interface, either fine tune the model weights via a smaller fine-tuning corpus, hence creating the ft model, and then upload their model to huggingface if they want, or directly have their model go into production to do inference/forecasting

    • The interface for the deep learning (global) forecaster should be able to save and load models quickly via sktime/huggingface functions so that the user can leverage the interface's predict function for any inferencing that they need.

There are 3 main use cases that I can think of right now for deep learning interfaced global forecasters

  1. Model can do zero shot predictions/forecasting, meaning that the user does not need to call any fit functions in order to use the predict functions. The user should have the ability to load the pre-trained model directly and begin using for forecasting. Maybe some refactoring will be required since self._is_fitted is usually required to be True via a fit function in order to do predictions.
  2. Model requires fine-tuning on a smaller dataset. In this case the user will load the pre-trained model and the fine-tune it on a smaller dataset of their choice. Would using the fit function be the best idea here? or would a secondary 'fit' function be required or perhaps an 'update'. I think during a tech session it was describe that an update function would be useful to further do any small fine-tunings to update a small portion of the weights
  3. Model is set as pristine and is fitted on a dataset. This scenario would be quite rare/unlikely since the training would have to start again from scratch, and would likely be computationally intensive due to the large number of parameters of most transformer models.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 29, 2024

It this diagram more or less correct?

"correct" being a subjective term, it aligns with my understanding though. Minor comments:

  • huggingface hub is one common way for companies/groups to share weights. As we have discovered, it is actually not the norm in forecasting (unlike in language/image learning) - it seems more common that weights are put in a random location on the internet.
  • the diagram highlights the most common case for FM. In the rarer case where the full code for the architecture and its training is available, users also have access to the pristine model, and may opt to train it to obtain their own FM. Ultimately, I think we want to cover both scenarios, but currently I feel we ought to focus on 3rd party architectures to build a consistent interface.

Regarding your use cases:

Re 1. yes, "zero shot" predictions are indeed a key subcase. In this case, the question is, how do we map the context? For zero-shot models, the context could be mapped into fit - this would not necessitate the "global" interface; or, we could map the context to predict. This is FK design 1 vs FK design 2 below, I will paste some thoughts.

Re 2. Indeed, I have been thinking about a design with "two fits" as well, or using update. These are further options, but not using the "global forcasting" interface, instead mapping onto current. I think a mild extension of predict, as done by @Xinyu-Wu-0000, results in a more natural user experience.

Re 3. yes, I would park this use case. We should keep it in mind though.

There is a use case no.4, where a user fine-tunes their model, and wants to make the weights available as a serialized model, or as weights on hugging face. That is something imo we also should try support, and before use case 3.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 29, 2024

I'm pasting two designs which differ in where the context is passed. They are both applicable to zero-shot learning, but only design 2 is applicable to fine-tuning.

The unpleasant point about the two designs is that they are discrepant designs in the case of zero shot learning.

Design 1

design1

  • there is no fine-tuning
  • the context is passed in fit, as y
  • the forecast is obtained in predict. No y is passed in predict

Design 2

design2

  • fine-tuning happens in fit, the corpus is passed as y
  • the context is passed in predict, as y - this argument is new and taken from the "global forecasting" design implemented by @Xinyu-Wu-0000, see [ENH] DeepAR and NHiTS and refinements for pytorch-forecasting interface #6551
  • there are two sub-options on how to model zero-shot learning here:
    • 2A. zero-shot is if no y is passed in predict. Then the y from fit is taken as context. This is consistent with design 1, and current main usage of sktime forecasters.
    • 2B. zero-shot is if no y is passed in fit. Then the y from predict is taken as context. This is consistent with the fine-tuning case, as the context maps on the same argument.

It is espcially vexing that the difference between 2A and 2B shows that it might not be possible to remain consistent both with the current main usage, and the fine-tuning usage in design 2.

@julian-fong
Copy link
Contributor

julian-fong commented Jun 29, 2024

Since every deep learning model could be different, it may be useful to leverage specific deep learning _tags in order to let sktime and the user 'know' what kind of functions the interface should be capable of. The _tag could also enhance some of the code inside the BaseForecaster classes so that the deep learning estimator is more flexible to accommodate any changes

For example: consider

PytorchForecastingNBeats

  • is not a pre-trained model but a model architecture, so users would typically train on their own dataset, and use the corresponding predict function to do inferencing. This mirrors the typical sktime model implementation, so not many edits would be required onto fit or predict. A tag is_pristine could be used to let sktime/users know that this model must be trained by the user and does not offer any fine-tune or zero-shot capabilities.

TinyTimeMixers (see https://huggingface.co/ibm-granite/granite-timeseries-ttm-v1)

  • is a pre-trained model that can do zero shot forecasting. With a _tag can_zero_shot, users can directly use predict without calling fit before hand, and self._is_fitted is set to True
  • is a model can also be fine-tuned to a smaller dataset to improve accuracy on forecasts. In that case, a _tag can_fine_tune = True can be introduced to let sktime know that the user has the capability to specify a pre-trained model to fine-tune via a parameter in the __init__() function. An attribute self.model_path can be set so that the fit function can then fine tune that pre-trained model on the new dataset.

momentfm.forecasting

  • This model requires fine-tuning (or few shotting) on a smaller dataset, so fit must be called in order to use predict. However this use case is different from the first, as the recommended usage for the model is to load the pre-trained model for fine tuning instead of re-intializing all the weights (thus making it pristine) and fitting it on the dataset. A _tag requires_fine_tune = True can be set in order let the user know that fine-tuning is required in order to make predictions.

Having tags could be extremely helpful for new users who don't know what model capabilities are available. For example, for momentfm.forecasting, it took a bit of digging just to realize that zero-shot capabilities were not available, and it is still yet to be known whether the model could be re-trained from scratch on a new dataset.

A con for this idea is that having too many new _tags could prove a bit of an annoyance to have set for every single deep learning forecasting model.

@benHeid
Copy link
Contributor

benHeid commented Jun 29, 2024

Mhm... Perhaps, we can even do something like a compositor based structure here. E.g., if you are taking a look at PEFT library, it is just wrapping around the original model and behaves like a normal model. It is even possible to merge the weights into the original model to get the original model structure back.

So I wonder if it would be possible if the Foundation Model are GlobalForecaster (perhaps a special type of global forecaster), and the fine-tuning module is of the same type as the foundation models. Interface wise it would than probably build as follows:

class PEFTTunedModel(BaseGlobalForecaster):

   def __init__(self, foundation_model, peft_config):
        self.foundation_model = foundation_model
        self.peft_config = peft_config


    def fit(self, ....):
        self._model = get_peft_model(self.foundation_model, peft_config)
        # do global fit stuff


    def predict(self, ...)
        # do global predict stuff

    def merge_weights(self, ....)
       # Peft allows the merging of weights
       # The merged model would have the same nn structure as type(self.foundation_model). 
       # Thus, the merge_weights method might even return a new instance of type(self.foundation_model)

     def upload_weights(self, ....)
         # Upload the weights or store the weights from the model inclusive of the adapters (or the merged weights)

Drawbacks:

  • We need to get the original nn model from an sktime class

Benefits:

  • Strong orientation on how PEFT is working from a methodological point of view.
  • Interface-wise, I suppose it is almost completely covered with the current code base. I suppose that no changes are needed regarding the API. However, deprecation cycles for some changes are needed.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 30, 2024

@julian-fong, very vaid points!

I think we shoud write down speculative code for the use cases you have specified.

I think most of your suggestions make sense, including for tags. However, we should:

  • make sure that we do not proliferate interface points and tags; of course we can consider alternate designs at the moment
  • make sure that suggestions "add up" in the sense of all information is passed in the right order
    • for instance, if in the zero-shot case, we skip fit, where do we pass the context? I.e., y?

@benHeid, I also think we need to think about the serialization pathways - we may not want to tie ourselves to one vendor (eg, hugging face, here)

For the PEFT compositor to work, the deep learning based modes need some special interface point?

@benHeid
Copy link
Contributor

benHeid commented Jun 30, 2024

@benHeid, I also think we need to think about the serialization pathways - we may not want to tie ourselves to one vendor (eg, hugging face, here)

Mhm, not sure what you are referring too, binding to a vendor would only be the case during upload or? If we are having HF models, we can store the weights also locally. Regarding the usage of PEFT, true this is from HF. However, I am not aware of any alternative. Furthermore, in my opinion, the compositor approach should also be applicable to other (perhaps existing fine-tuning libraries).

For the PEFT compositor to work, the deep learning based modes need some special interface point?

No, there needs no special interface point in general. However, in specific cases there might be restrictions e.g., there underlying model needs to be a transformer nn. E.g., according to the PEFT library, their LoRA implementation is applicable to all neural networks (perhaps only to all torch model, not sure about the tensorflow support). E.g., here is an example for using LoRA with a simple MLP: https://huggingface.co/docs/peft/developer_guides/custom_models

Thus, I think it would be very cool if we are able to implement it as a compositor.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 30, 2024

Mhm, not sure what you are referring too, binding to a vendor would only be the case during upload or?

I meant, the format or location of serialized models. HF is a good face to share them, but we should also think about users who may want this entirely "off the grid", e.g., in a closed code base.

No, there needs no special interface point in general.

Are you sure? Does it not at least need to be a neural network?

@benHeid
Copy link
Contributor

benHeid commented Jul 3, 2024

Mhm, not sure what you are referring too, binding to a vendor would only be the case during upload or?

I meant, the format or location of serialized models. HF is a good face to share them, but we should also think about users who may want this entirely "off the grid", e.g., in a closed code base.

Yes, upload_weights should not only support HF, it should also enable local model storing or pushing weights into other repositories. Regarding storing the weights, we can also support different formats. However, I think safetensors should be supported directly.

No, there needs no special interface point in general.

Are you sure? Does it not at least need to be a neural network?

Oh, yes I misunderstood it. Yes, it needs to be a neural network. Perhaps even a torch model.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 4, 2024

Yes, upload_weights should not only support HF, it should also enable local model storing or pushing weights into other repositories.

How about using save and load here? We have been silently drifting anyway to allowing multiple serialization backends. More high-level, I am in favour on reusing and unifying interface points if they are semantically similar or identical (only differring in concretion but not abstract object or procedure).

@benHeid
Copy link
Contributor

benHeid commented Jul 4, 2024

How about using save and load here? We have been silently drifting anyway to allowing multiple serialization backends. More high-level, I am in favour on reusing and unifying interface points if they are semantically similar or identical (only differring in concretion but not abstract object or procedure).

Make sense. So the reversed version would look like:

class PEFTTunedModel(BaseGlobalForecaster):

   def __init__(self, foundation_model, peft_config):
        self.foundation_model = foundation_model
        self.peft_config = peft_config


    def fit(self, ....):
        self._model = get_peft_model(self.foundation_model, peft_config)
        # do global fit stuff


    def predict(self, ...)
        # do global predict stuff

    def merge_weights(self, ....)
       # Peft allows the merging of weights
       # The merged model would have the same nn structure as type(self.foundation_model). 
       # Thus, the merge_weights method might even return a new instance of type(self.foundation_model)

     def save(self, ....)
         # Save the unmerged weights.
         # To save merged weights, the user needs to use: model.merge_weights().save()
         # TODO: How to control serialization method? Via argument of save or class parameter? 

    def load(self, ..)
        # Load unmerged weights.

A further open TODO is that we must define an interface for getting the neural network for which we are applying the PEFT methods. E.g., we could let all deep learning methods that should be finetunable inherit from a common Base class that enforces that a method get_nn_model is implemented that returns the internally used nn-model.

@pranavvp16
Copy link
Contributor

Mhm... Perhaps, we can even do something like a compositor based structure here. E.g., if you are taking a look at PEFT library, it is just wrapping around the original model and behaves like a normal model. It is even possible to merge the weights into the original model to get the original model structure back.

Yes, this would be a good idea to implement as PEFT is more general for all transformers and most of the neural network architectures providing a good level of abstraction for sktime users to finetune foundation models and get there own model.

Things I'm concerned about is every foundation model has different type of data design on which it was trained which directly corresponds to the architecture of the model. i.e the number of input parameters of the forward method. So how does that work for PEFT models, I think the data needs to be preprocessed to make it compatible with the input layers of the model. The format for the data also needs to be hugging face dataset, if I'm not wrong. In this scenarios, what I think as a solution would be in the fit method we get the special type of datatype too something like the TimeSerieDataset from tsfm library, as the self.foundation_model will also correspond the data preprocessing function thats being used in sktime class.

So the workflow will be something like,

  1. Sktime user passes sktime compatible datacontainer
  2. Get the nn model as well as the preprocessing function from sktime class.
  3. Use the preprocessing function to make data compatible with input layers of the model.
  4. fit i.e finetune.

Other functions would be easy to implement if we get this working atleast. For load and save the other formats used widely could be .ckpt, .h5, .pt, while .safetensors comes by default. We should leave functions like upload_weights i.e is push_to_hub on users. As it will require also uploading the config.json for model initialisation along with weights. Users could just save wieghts locally and upload it wherever they like eg: kaggle, HF, or github

@julian-fong
Copy link
Contributor

julian-fong commented Jul 9, 2024

You would probably have to design conversion methods from sktime pandas Dataframes into huggingface's datasetdicts if the design method involves wrapping around huggingface transformers models. Since time series datasets are pretty much tabular, I wonder how that would work.

edit: now that we have our own huggingface org, why not try to upload an open source dataset onto it and see how that works? 😆

@benHeid
Copy link
Contributor

benHeid commented Jul 10, 2024

The underlying model has to implement it too e.g. each of the foundation models. Thus, I would aim for building a proper interface in the DL based forecaster to get the dataset. We have to adapt it anyways to get the underlying neural network.

This evening, I try to post an extended interface design including also the interface for Models that can be passed to the PEFTForecaster.

@pranavvp16
Copy link
Contributor

pranavvp16 commented Jul 10, 2024

The underlying model has to implement it too e.g. each of the foundation models. Thus, I would aim for building a proper interface in the DL based forecaster to get the dataset. We have to adapt it anyways to get the underlying neural network.

Yes that's where I'm pointing too. Currently what I see in sktime is models do preprocessing in the fit method, maybe the BaseGlobalForecaster should also have a build_dataset function/interface that every foundation model should implement. This will make preprocessing a component of the sktime class of foundation model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
None yet
Development

No branches or pull requests

5 participants