Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incremental training option to DAGMM model #65

Merged
merged 9 commits into from
Feb 4, 2022

Conversation

isenilov
Copy link
Contributor

@isenilov isenilov commented Jan 27, 2022

Currently, the DAGMM model object gets created every time one calls the train method

self.dagmm = self._build_model(X.shape[1]).to(self.device)

However, it makes it impossible to:

  • update the model by training it on a new dataset
  • train the model on several timeseries with the same timestamps and column names

The proposed change adds an option to perform incremental training passing corresponding dictionary to the train method which disables the model recreation. The change does not affect the existing API so the behavior stays the same if no train_config is passed.

The same change can also be applied to some of the other models.

@aadyotb would love to hear your opinion.

@aadyotb
Copy link
Contributor

aadyotb commented Jan 28, 2022

Thanks for the contribution. I'm hesitant to merge this as is because it's a pretty ad-hoc way of handling multi-series settings (a feature requested by #52). The main downsides are:

  1. It requires incremental calls to train() and doesn't provide scope for a single model to train on multiple time series in a single go. I can envision settings where batch training across time series may be called for.
  2. We would need to add a lot of boilerplate to individual models in order to support the current implementation.

If this is a direction you're interested in pursuing, I think a better approach would be to design a base class / mixin which implements common features of multi-series training (incremental training would be a good default option). Then, any model which supports such behavior can simply inherit from this base class. If a model has specific batch training which handles all time series together, it can override the default incremental multi-series training. I'm open to further discussion on this topic though.

As an aside: the docs job is currently failing because the docstring for train_config is not formatted as proper ReST. You'd probably want to create a bulleted list for this, with the syntax described here.

@isenilov
Copy link
Contributor Author

isenilov commented Jan 29, 2022

Thank you for the detailed comment!

it's a pretty ad-hoc way of handling multi-series

Agree :) That is why it is still a draft created for further discussion.

What you propose definitely makes sense. I guess changing DetectorBase and adding a new method like train_incremental is not an optimal way as it would require changing all the models that inherit from it.

Even though I am not a fan of multiple inheritance, this seems like a good use case for it.
So going to the details I think you propose the following model supporting incremental training:

# base.py
class IncrementalTrainingMixin:
    @abstractmethod
    def train(train_data: Union[TimeSeries, Iterable[TimeSeries]], ...):
        # the rest of the arguments are the same as in the `DetectorBase`
        pass

# dagmm.py
class DAGMM(IncrementalTrainingMixin, DetectorBase):
    # the order means the `train` method from the`IncrementalTrainingMixin` overrides the one from the `DetectorBase`
    def train(train_data: Union[TimeSeries, Iterable[TimeSeries]], ...):
        if isinstance (train_data, TimeSeries):
            ...
            # run the existing training method
            self._train(X=data)
        if isinstance (train_data, Iterable[TimeSeries]):
            # incrementally train the model as proposed in the PR
            ...
            for data in train_data:
                self._train(X=data, incremental=True)

What do you think?
If this is okayish direction - two other questions:

  1. Should we override the train method or just add something like train_multiple to IncrementalTrainingMixin?
  2. Data shuffling has to be handled outside of the train method in this implementation. Maybe we should have it inside with a shuffle: bool argument? Will be tricky to handle in the case of passed iterator/generator though.

@aadyotb
Copy link
Contributor

aadyotb commented Jan 29, 2022

This direction is more or less what I had in mind. I think it makes most sense to add a new train_multiple method, to avoid adding a lot of boilerplate code such as the if/else blocks. Additionally, it's clear way to signal which models do or don't support the feature.

For data shuffling, I think this can be quite model-specific. But some common behaviors (across both anomaly detection and forecasting) include

  • training on one time series at a time and possibly shuffling the order
  • training for multiple epochs (i.e. multiple passes over the training data)
  • creating a dataset of (lookback, lookahead) slices from each time series and shuffling them (the file models/forecast/seq_ar_common.py contains helper code for this & is used for tree models; however we may wish to re-visit this behavior in a second PR, and abstract it out a little more)
  • others you can think of?

Of course, some of these may be complicated if we are using a lazy iterator of time series. So maybe we could start by assuming we have a List[TimeSeries] and then introduce a custom lazy loader at a later time, if the need arises.

I suspect that the cleanest implementation will be to include params shuffle and n_epochs in a train config dict.

What do you think?

@isenilov
Copy link
Contributor Author

isenilov commented Jan 31, 2022

Thanks @aadyotb for the comment! This makes sense, let me draft something tomorrow.

@isenilov
Copy link
Contributor Author

isenilov commented Feb 1, 2022

Added the initial (draft) implementation of multiple series training.
It has several limitations/TODOs/points for discussion:

  • It is implemented only for the particular DAGMM model having MultipleTimeseriesDetectorMixin and MultipleTimeseriesModelMixin chain of abstract mixins. Please advise if it conforms to Merlion architecture or part of the logic can be shared somewhere upward the inheritance chain.
  • A model is no longer recreated if exists in the train method.
  • Very basic shuffling implementation.
  • post_rule_train_config is just propagated to the train method as is and runs the same number of times. Not sure this is expected behavior.
  • n_pochs and shuffle have default values of 1 and False in the train_multiple method - there might be a better place to define them and better default values themselves.
  • no unit tests for the method (TODO)

PS: not sure why the doc building fails...

@aadyotb
Copy link
Contributor

aadyotb commented Feb 2, 2022

PS: not sure why the doc building fails...

You should call pip install -r docs/requirements.txt and sphinx-build -b html docs/source docs/build/html from the rootdir of Merlion. The docs CI is configured to fail if there are any warnings, so you should figure out where the warning is coming from & fix it accordingly.

Copy link
Contributor

@aadyotb aadyotb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! This looks mostly good to me, barring a couple of small issues I noted inline. Happy to approve once these are fixed. Also, see my comment above on how to diagnose why the docs build is failing.

else:
anomaly_labels = [None] * len(train_data)
n_epochs = train_config.pop("n_epochs", 1)
shuffle = train_config.pop("shuffle", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can make this shuffle = train_config.pop("shuffle", n_epochs > 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will also add that shuffling is turned on by default for n_epochs > 1 in the docstring.

train_config=train_config, post_rule_train_config=post_rule_train_config
)
)
return train_scores_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential issue here, where the post-rule (calibrator and threshold) is trained individually on each time series. I think this is fine for the time being, but can you add a #FIXME comment here saying that the post-rule needs to be re-trained on the train_scores from all the models?

Copy link
Contributor Author

@isenilov isenilov Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was not sure how this incremental training would affect the post_rule...

on the train_scores from all the models

Do you mean from all the epochs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from all the epochs.

@isenilov
Copy link
Contributor Author

isenilov commented Feb 3, 2022

Thank you for the feedback! I will address the remarks and add a unit test for the method as well to make it complete.
UPD: In case merge - when do you think the next version with these changes is going to be released?

@isenilov isenilov marked this pull request as ready for review February 3, 2022 13:02
self, train_data: List[TimeSeries], anomaly_labels: List[TimeSeries] = None,
train_config=None, post_rule_train_config=None
self, multiple_train_data: List[TimeSeries], anomaly_labels: List[TimeSeries] = None,
train_config=dict(), post_rule_train_config=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the default value of train_config back to None? See here for why train_config=dict() can be a problem. The preferred pattern would be train_config = {} if train_config is None else train_config. Looks good to me once this is addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 Spending time with other languages makes me forget some of the Python pitfalls.
I agree that it is safer to avoid this at all even if a mutation doesn't happen.
Returned the None back with additional if.

@aadyotb
Copy link
Contributor

aadyotb commented Feb 3, 2022

UPD: In case merge - when do you think the next version with these changes is going to be released?

I can push out v1.1.2 once this is merged, along with one other PR in the works by me.

@aadyotb
Copy link
Contributor

aadyotb commented Mar 3, 2022

@isenilov v1.1.2 is now out with this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants