-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFX Use Case - Windowing Raw Data #210
Comments
Hi, Kamal, currently we support once shot pipeline which can be scheduled daily with different input path as different data window, long term we have plan for continuous training with data span supporting, keep this as a feature request, thanks! |
In the meanwhile, could someone help out and expla how to mimic even one-shot batch learning with windowed data on TFX? I would like to train a lstm using my pipeline but I'm not sure where and how to put the windowing. Should it be part of transform? Should it be part trainer? Or should it be in a custom examplegen executor |
I am also currently trying to build up a similar pipeline which deals with time-series data. Even in the case of one-shot training as opposed to continuous training, it is necessary to integrate some form of windowing algorithm on the time domain before feeding the data into the model. As far as I could see, TFX does not support this behavior at the moment. Even so, I believe that there are multiple ways to implement this behavior, however I am not quite sure, where it needs to come into play Is it supposed to be inside a CustomExampleGen when reading the data? Is it supposed to be in the preprocessing_fn inside the Transform component or even in the input_fn inside the Trainer? Should it be a part of the Graph which will be used during Serving? Additionally, will TFX support windowing over timestamps in the future for applications dealing with tiime-series? |
currently there are two ways to use one-shot batch for multiple data span
In long term, we plan to support data span in a similar way as 1) with some span auto detection logic, and all spans will be tracked in metadata. |
@1025KB thanks for your reply! However, i think the question is still a bit unanswered: as TFX is supposed to be an end-to-end framework, the goal is to get a model at the end which can be used in production. If we prepare the windowed (more specifically aggregated) data before running the pipeline, as you suggest, what do we do when we are getting live data in production, to serve predictions? So lets say I want to make a model from a batch of raw data , but i want to aggregate/resample it to 5 minute intervals (due to many time series reasons). I somehow aggregate the raw data, put it in a folder as you suggest, and then run the TFX pipeline, create a model and deploy it. However, now when i get want to do serving in production, how can i make sure the data coming in is also automatically aggregated at 5 minute intervals? Ideally, this aggregation would happen in TF Transform (so it can be exported in the graph), or maybe even in the serving_receiver_fn in TF Trainer component) as @bcdurak mentioned already. Any help on this would be appreciated :-) |
oh, I see what you mean by windowing data here are two separate problems
all my comments above are talking about the second problem the starting point (rpc request) of serving is tf.example, same as tfx pipeline components after ExampleGen, the tf.examples send to serving should be the same (format/structure/meaning, not value) as ExampleGen's output tf.examples for serving to work properly, so whatever conversion logic applied in examplegen need to be applied to the data before sending to tf serving. For ExampleGen, it can either import external tf examples directly, or do some customization by custom example gen Normally in production, the first stage is within log processing, this raw data to tf.example conversion might contain sessionization which varies for different use cases, so we didn't provide an abstract way for handling it except custom example gen. A real use case might look like this: server side or client side raw logs -> log processing which contains windowing or sessionization -> structured data -> example gen -> tf.examples (e.g., feature: how many clicks in this session, predict: how long this user stays) -> following tfx components when serving, based on the information of a certain session, create tf.examples (how many clicks in this session) and tf serving will predict how long this user will stay. |
@KamalAman @htahir1 Did the above comment answer your question? |
@KamalAman, |
let's keep this as a feature request |
Hi, Is there any update on the windowing features in TFX ? |
we don't support streaming data processing. We support training based on rolling window of spans. But I guess you want to sessionize some raw logs into training examples, consider using custom component or custom examplegen. |
That was dissapointing!
|
Do you want your model being updated streamingly? or you want to process your log streaming? |
Thanks for the reply.
|
Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions. |
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you. |
This issue was closed due to lack of activity after being marked stale for past 7 days. |
Windowing data is one of the key corner stones of data processing on large amounts of data.
Will TFX support the use case to consume from raw data, window the data with Beam, perform statsGen on the windows as well as the raw data, to finally consume the windowed/raw data in Tensorflow to train off of?
In addition, this brings up questions about when the model is deployed. Which version of the data should you call the serving model with? The raw data, or the windowed data.
The text was updated successfully, but these errors were encountered: