Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFX Use Case - Windowing Raw Data #210

Closed
KamalAman opened this issue Jun 5, 2019 · 17 comments
Closed

TFX Use Case - Windowing Raw Data #210

KamalAman opened this issue Jun 5, 2019 · 17 comments

Comments

@KamalAman
Copy link

KamalAman commented Jun 5, 2019

Windowing data is one of the key corner stones of data processing on large amounts of data.

Will TFX support the use case to consume from raw data, window the data with Beam, perform statsGen on the windows as well as the raw data, to finally consume the windowed/raw data in Tensorflow to train off of?

In addition, this brings up questions about when the model is deployed. Which version of the data should you call the serving model with? The raw data, or the windowed data.

@1025KB
Copy link
Collaborator

1025KB commented Jun 6, 2019

Hi, Kamal, currently we support once shot pipeline which can be scheduled daily with different input path as different data window, long term we have plan for continuous training with data span supporting, keep this as a feature request, thanks!

@htahir1
Copy link

htahir1 commented Aug 5, 2019

In the meanwhile, could someone help out and expla how to mimic even one-shot batch learning with windowed data on TFX? I would like to train a lstm using my pipeline but I'm not sure where and how to put the windowing. Should it be part of transform? Should it be part trainer? Or should it be in a custom examplegen executor

@bcdurak
Copy link

bcdurak commented Aug 8, 2019

I am also currently trying to build up a similar pipeline which deals with time-series data. Even in the case of one-shot training as opposed to continuous training, it is necessary to integrate some form of windowing algorithm on the time domain before feeding the data into the model.

As far as I could see, TFX does not support this behavior at the moment. Even so, I believe that there are multiple ways to implement this behavior, however I am not quite sure, where it needs to come into play

Is it supposed to be inside a CustomExampleGen when reading the data? Is it supposed to be in the preprocessing_fn inside the Transform component or even in the input_fn inside the Trainer? Should it be a part of the Graph which will be used during Serving?

Additionally, will TFX support windowing over timestamps in the future for applications dealing with tiime-series?

@1025KB
Copy link
Collaborator

1025KB commented Aug 8, 2019

currently there are two ways to use one-shot batch for multiple data span

  1. prepare the data span in different paths, e.g., /tmp/YYYY-MM-DD/data,
    trigger the pipeline with a different input_config split pattern (just a single split as input, output will be train/eval split)

  2. just put updated data in the same input_base folder and retrigger the pipeline, we will detected the change and treat it as a new run instead of cache run

In long term, we plan to support data span in a similar way as 1) with some span auto detection logic, and all spans will be tracked in metadata.

@htahir1
Copy link

htahir1 commented Aug 8, 2019

@1025KB thanks for your reply! However, i think the question is still a bit unanswered: as TFX is supposed to be an end-to-end framework, the goal is to get a model at the end which can be used in production.

If we prepare the windowed (more specifically aggregated) data before running the pipeline, as you suggest, what do we do when we are getting live data in production, to serve predictions?

So lets say I want to make a model from a batch of raw data , but i want to aggregate/resample it to 5 minute intervals (due to many time series reasons). I somehow aggregate the raw data, put it in a folder as you suggest, and then run the TFX pipeline, create a model and deploy it.

However, now when i get want to do serving in production, how can i make sure the data coming in is also automatically aggregated at 5 minute intervals?

Ideally, this aggregation would happen in TF Transform (so it can be exported in the graph), or maybe even in the serving_receiver_fn in TF Trainer component) as @bcdurak mentioned already.

Any help on this would be appreciated :-)

@1025KB
Copy link
Collaborator

1025KB commented Aug 8, 2019

oh, I see what you mean by windowing data

here are two separate problems

  1. from raw data to tf examples (or sequence examples that we don't support yet)
  2. group multiple tf examples into dataset (aka data span) for continuous training

all my comments above are talking about the second problem

the starting point (rpc request) of serving is tf.example, same as tfx pipeline components after ExampleGen, the tf.examples send to serving should be the same (format/structure/meaning, not value) as ExampleGen's output tf.examples for serving to work properly, so whatever conversion logic applied in examplegen need to be applied to the data before sending to tf serving.

For ExampleGen, it can either import external tf examples directly, or do some customization by custom example gen

Normally in production, the first stage is within log processing, this raw data to tf.example conversion might contain sessionization which varies for different use cases, so we didn't provide an abstract way for handling it except custom example gen.

A real use case might look like this:

server side or client side raw logs -> log processing which contains windowing or sessionization -> structured data -> example gen -> tf.examples (e.g., feature: how many clicks in this session, predict: how long this user stays) -> following tfx components

when serving, based on the information of a certain session, create tf.examples (how many clicks in this session) and tf serving will predict how long this user will stay.

@gowthamkpr
Copy link

@KamalAman @htahir1 Did the above comment answer your question?

@rmothukuru
Copy link
Contributor

@KamalAman,
Can you please let us know if @1025's comment has addressed your use case, so that we can close this issue. Thanks!

@1025KB
Copy link
Collaborator

1025KB commented Nov 2, 2020

let's keep this as a feature request

@mlopsbegineer
Copy link

Hi, Is there any update on the windowing features in TFX ?
I want to use time-series data in rolling window input data and use LSTM layers for my model.

@1025KB
Copy link
Collaborator

1025KB commented May 17, 2022

we don't support streaming data processing. We support training based on rolling window of spans.

But I guess you want to sessionize some raw logs into training examples, consider using custom component or custom examplegen.

@Asphalt2017
Copy link

That was dissapointing!

  1. Are there any other alternatives to load the data for rolling window needed for time series? In case of rolling windows of span suppose i want to feed hourly the last ten hours of feed. So i have to create hourly an tf record of 10 hr of feed, then feed then together? Is my understanding correct?
  2. Moreover, tf-record doesn't support muti-dim arrary. Do they?
  3. How do you suggest we do time series with tfx. The broader question is what can tfx do? Where it could be used ? It is kind of misleading when it says it is build for production when it just can't be used for a subset of ML.
  4. Can't the input_fn() inside trainer be modified to create windows or custom functions?

@1025KB
Copy link
Collaborator

1025KB commented May 18, 2022

Do you want your model being updated streamingly? or you want to process your log streaming?
log processing pipeline can be separate with training pipeline, while training part takes training examples formed based on your modeling case.

@Asphalt2017
Copy link

Thanks for the reply.

  1. Could you elaborate about log processing pipeline?
  2. I have to predict next hour of data from last 10 hrs of data. A batch consist of multiple inputs and each input consist of 10hrs of data.
  3. The solution is either rolling windows function or a multi dimension array as inputs. I don't seem either two of them working as 1. I don't where and how to apply windowing and 2. I can't write tf.record with multidimensional array (unlike json where just another set of brackets represents another dimension)

@singhniraj08
Copy link
Contributor

Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions.

@github-actions
Copy link

github-actions bot commented Jun 1, 2023

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label Jun 1, 2023
@github-actions
Copy link

github-actions bot commented Jun 9, 2023

This issue was closed due to lack of activity after being marked stale for past 7 days.

@github-actions github-actions bot closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests