Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing non-contiguous segments #171

Open
sarahmish opened this issue Feb 2, 2021 · 6 comments
Open

Preprocessing non-contiguous segments #171

sarahmish opened this issue Feb 2, 2021 · 6 comments
Labels
new feature New feature

Comments

@sarahmish
Copy link
Collaborator

Currently most pipelines share the same preprocessing primitives and in the following order:

  1. mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate
    this makes the signal equi-spaced based on the specified interval.

  2. sklearn.impute.SimpleImputer
    for imputing missing values.

  3. sklearn.preprocessing.MinMaxScaler
    normalizing the data between a specified range.

  4. mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences
    creating multiple training window examples based on the window_size and step_size.

However, it is not always the case that we want to make the signal equi-spaced but rather retain the gaps within the signal. For this task, there are two main considerations that need to happen.

  1. normalize the data first to maintain the specified range.
  2. create segments based on the suggested max_gap, then for each segment apply the primitive 1, 2 & 4 shown above, then concatenate them together.

the sequence of preprocessing primitives would be:

"sklearn.preprocessing.MinMaxScaler",
"orion.primitives.timeseries_preprocessing.segment", # suggested
"mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate",
"sklearn.impute.SimpleImputer",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
"orion.primitives.timeseries_preprocessing.concatenate" # suggested
@kb1ooo
Copy link

kb1ooo commented Oct 26, 2021

I don't see any activity here, but I'm wondering if this may have been addressed since Feb?

@sarahmish
Copy link
Collaborator Author

Hi @kb1ooo! It's still under works

@kb1ooo
Copy link

kb1ooo commented Oct 29, 2021

@sarahmish thanks. Is there some work on it checked into a branch?

@sarahmish
Copy link
Collaborator Author

There isn't an active branch on this case. The primary change for this feature is in the rolling_window_sequences primitive. It currently works by slicing based on indexes. To make this change, we need to introduce slicing by timestamps and using a max_gap parameter to indicate the maximum gaps to between one element and another.

@kb1ooo
Copy link

kb1ooo commented Nov 1, 2021

@sarahmish ok right. Is there a simpler intermediate version where basically the data is pre-segmented (i.e. don't delegate the segmentation logic to orion, let it be the responsibility of the caller), and you would pass the data as say a list of dataframes instead of one dataframe? Then just iterate through the list, applying the same pipeline, and concatenate the rolling_window_sequences.

@sarahmish
Copy link
Collaborator Author

@kb1ooo that's definitely possible. Mechanically, you can just iterate over each dataframe calling orion.fit as a simple work around. My only concern is that you will be training the ML model on epochs with different batches each time. I don't know how that will affect the learning of the underlying model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature
Projects
None yet
Development

No branches or pull requests

2 participants