TFX Use Case - Windowing Raw Data #210

KamalAman · 2019-06-05T21:35:35Z

Windowing data is one of the key corner stones of data processing on large amounts of data.

Will TFX support the use case to consume from raw data, window the data with Beam, perform statsGen on the windows as well as the raw data, to finally consume the windowed/raw data in Tensorflow to train off of?

In addition, this brings up questions about when the model is deployed. Which version of the data should you call the serving model with? The raw data, or the windowed data.

1025KB · 2019-06-06T17:28:49Z

Hi, Kamal, currently we support once shot pipeline which can be scheduled daily with different input path as different data window, long term we have plan for continuous training with data span supporting, keep this as a feature request, thanks!

htahir1 · 2019-08-05T06:21:22Z

In the meanwhile, could someone help out and expla how to mimic even one-shot batch learning with windowed data on TFX? I would like to train a lstm using my pipeline but I'm not sure where and how to put the windowing. Should it be part of transform? Should it be part trainer? Or should it be in a custom examplegen executor

bcdurak · 2019-08-08T13:37:59Z

I am also currently trying to build up a similar pipeline which deals with time-series data. Even in the case of one-shot training as opposed to continuous training, it is necessary to integrate some form of windowing algorithm on the time domain before feeding the data into the model.

As far as I could see, TFX does not support this behavior at the moment. Even so, I believe that there are multiple ways to implement this behavior, however I am not quite sure, where it needs to come into play

Is it supposed to be inside a CustomExampleGen when reading the data? Is it supposed to be in the preprocessing_fn inside the Transform component or even in the input_fn inside the Trainer? Should it be a part of the Graph which will be used during Serving?

Additionally, will TFX support windowing over timestamps in the future for applications dealing with tiime-series?

1025KB · 2019-08-08T17:30:59Z

currently there are two ways to use one-shot batch for multiple data span

prepare the data span in different paths, e.g., /tmp/YYYY-MM-DD/data,
trigger the pipeline with a different input_config split pattern (just a single split as input, output will be train/eval split)
just put updated data in the same input_base folder and retrigger the pipeline, we will detected the change and treat it as a new run instead of cache run

In long term, we plan to support data span in a similar way as 1) with some span auto detection logic, and all spans will be tracked in metadata.

htahir1 · 2019-08-08T18:44:57Z

@1025KB thanks for your reply! However, i think the question is still a bit unanswered: as TFX is supposed to be an end-to-end framework, the goal is to get a model at the end which can be used in production.

If we prepare the windowed (more specifically aggregated) data before running the pipeline, as you suggest, what do we do when we are getting live data in production, to serve predictions?

So lets say I want to make a model from a batch of raw data , but i want to aggregate/resample it to 5 minute intervals (due to many time series reasons). I somehow aggregate the raw data, put it in a folder as you suggest, and then run the TFX pipeline, create a model and deploy it.

However, now when i get want to do serving in production, how can i make sure the data coming in is also automatically aggregated at 5 minute intervals?

Ideally, this aggregation would happen in TF Transform (so it can be exported in the graph), or maybe even in the serving_receiver_fn in TF Trainer component) as @bcdurak mentioned already.

Any help on this would be appreciated :-)

1025KB · 2019-08-08T20:20:53Z

oh, I see what you mean by windowing data

here are two separate problems

from raw data to tf examples (or sequence examples that we don't support yet)
group multiple tf examples into dataset (aka data span) for continuous training

all my comments above are talking about the second problem

the starting point (rpc request) of serving is tf.example, same as tfx pipeline components after ExampleGen, the tf.examples send to serving should be the same (format/structure/meaning, not value) as ExampleGen's output tf.examples for serving to work properly, so whatever conversion logic applied in examplegen need to be applied to the data before sending to tf serving.

For ExampleGen, it can either import external tf examples directly, or do some customization by custom example gen

Normally in production, the first stage is within log processing, this raw data to tf.example conversion might contain sessionization which varies for different use cases, so we didn't provide an abstract way for handling it except custom example gen.

A real use case might look like this:

server side or client side raw logs -> log processing which contains windowing or sessionization -> structured data -> example gen -> tf.examples (e.g., feature: how many clicks in this session, predict: how long this user stays) -> following tfx components

when serving, based on the information of a certain session, create tf.examples (how many clicks in this session) and tf serving will predict how long this user will stay.

gowthamkpr · 2020-01-09T00:34:39Z

@KamalAman @htahir1 Did the above comment answer your question?

rmothukuru · 2020-11-02T07:48:52Z

@KamalAman,
Can you please let us know if @1025's comment has addressed your use case, so that we can close this issue. Thanks!

1025KB · 2020-11-02T17:48:37Z

let's keep this as a feature request

mlopsbegineer · 2022-05-17T08:30:42Z

Hi, Is there any update on the windowing features in TFX ?
I want to use time-series data in rolling window input data and use LSTM layers for my model.

1025KB · 2022-05-17T18:20:13Z

we don't support streaming data processing. We support training based on rolling window of spans.

But I guess you want to sessionize some raw logs into training examples, consider using custom component or custom examplegen.

Asphalt2017 · 2022-05-18T06:39:11Z

That was dissapointing!

Are there any other alternatives to load the data for rolling window needed for time series? In case of rolling windows of span suppose i want to feed hourly the last ten hours of feed. So i have to create hourly an tf record of 10 hr of feed, then feed then together? Is my understanding correct?
Moreover, tf-record doesn't support muti-dim arrary. Do they?
How do you suggest we do time series with tfx. The broader question is what can tfx do? Where it could be used ? It is kind of misleading when it says it is build for production when it just can't be used for a subset of ML.
Can't the input_fn() inside trainer be modified to create windows or custom functions?

1025KB · 2022-05-18T18:12:21Z

Do you want your model being updated streamingly? or you want to process your log streaming?
log processing pipeline can be separate with training pipeline, while training part takes training examples formed based on your modeling case.

Asphalt2017 · 2022-05-19T06:06:44Z

Thanks for the reply.

Could you elaborate about log processing pipeline?
I have to predict next hour of data from last 10 hrs of data. A batch consist of multiple inputs and each input consist of 10hrs of data.
The solution is either rolling windows function or a multi dimension array as inputs. I don't seem either two of them working as 1. I don't where and how to apply windowing and 2. I can't write tf.record with multidimensional array (unlike json where just another set of brackets represents another dimension)

singhniraj08 · 2023-05-24T05:35:06Z

Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions.

github-actions · 2023-06-01T02:15:00Z

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-06-09T02:06:51Z

This issue was closed due to lack of activity after being marked stale for past 7 days.

gowthamkpr self-assigned this Jun 5, 2019

gowthamkpr added the type:support label Jun 5, 2019

1025KB added the type:feature label Jun 6, 2019

krazyhaas removed the type:support label Jun 7, 2019

rmothukuru added the stat:awaiting tensorflower label Jul 16, 2019

krazyhaas removed the stat:awaiting tensorflower label Jul 16, 2019

gowthamkpr assigned krazyhaas and unassigned gowthamkpr Jul 25, 2019

krazyhaas removed their assignment Jul 26, 2019

1025KB mentioned this issue Aug 3, 2019

Create a taxicab demo that runs in "Continuous Training" mode? #421

Closed

1025KB assigned ruoyu90 Aug 5, 2019

1025KB assigned zhitaoli Aug 8, 2019

gowthamkpr added the stat:awaiting response label Jan 9, 2020

iprapas mentioned this issue Mar 25, 2020

Status of continuous pipelines #1540

Closed

Mar-ai mentioned this issue Nov 11, 2020

Shuffling in ExampleGen should be optional #2764

Closed

gowthamkpr removed the stat:awaiting response label Oct 5, 2022

gowthamkpr added the stat:awaiting tensorflower label Oct 5, 2022

zhitaoli assigned 1025KB and unassigned zhitaoli Oct 10, 2022

singhniraj08 self-assigned this May 24, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting tensorflower labels May 24, 2023

github-actions bot added the stale label Jun 1, 2023

github-actions bot closed this as completed Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFX Use Case - Windowing Raw Data #210

TFX Use Case - Windowing Raw Data #210

KamalAman commented Jun 5, 2019 •

edited

1025KB commented Jun 6, 2019

htahir1 commented Aug 5, 2019

bcdurak commented Aug 8, 2019

1025KB commented Aug 8, 2019 •

edited

htahir1 commented Aug 8, 2019

1025KB commented Aug 8, 2019 •

edited

gowthamkpr commented Jan 9, 2020

rmothukuru commented Nov 2, 2020

1025KB commented Nov 2, 2020

mlopsbegineer commented May 17, 2022

1025KB commented May 17, 2022

Asphalt2017 commented May 18, 2022

1025KB commented May 18, 2022

Asphalt2017 commented May 19, 2022

singhniraj08 commented May 24, 2023

github-actions bot commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

TFX Use Case - Windowing Raw Data #210

TFX Use Case - Windowing Raw Data #210

Comments

KamalAman commented Jun 5, 2019 • edited

1025KB commented Jun 6, 2019

htahir1 commented Aug 5, 2019

bcdurak commented Aug 8, 2019

1025KB commented Aug 8, 2019 • edited

htahir1 commented Aug 8, 2019

1025KB commented Aug 8, 2019 • edited

gowthamkpr commented Jan 9, 2020

rmothukuru commented Nov 2, 2020

1025KB commented Nov 2, 2020

mlopsbegineer commented May 17, 2022

1025KB commented May 17, 2022

Asphalt2017 commented May 18, 2022

1025KB commented May 18, 2022

Asphalt2017 commented May 19, 2022

singhniraj08 commented May 24, 2023

github-actions bot commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

KamalAman commented Jun 5, 2019 •

edited

1025KB commented Aug 8, 2019 •

edited

1025KB commented Aug 8, 2019 •

edited