New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching is not Useful: {Kube,Air}flow Retries do not work and ExampleGen's cache isn't usefull #4226
Comments
Thanks for the advice!!
|
Thanks, Jiayi.
Are there some more detailed examples/docs on how query based get_input_fingerprint should be implemented?
I'm using a MySQL exmaplegen based on the Presto example.
… On 3 Sep 2021, at 01:26, Jiayi Zhao ***@***.***> wrote:
Thanks for the advice!!
TFX itself currently doesn't support retry, and outside retry might mess up TFX mlmd status (e.g., 2 output artifacts instead of 1, this is because downstream find two artifacts generated by retry in its input channel, but it only request one artifact), so I would suggest to turn off external retry currently.
Currently QueryBasedExampleGen only check if query is the same or not, to support advanced logic, you need to custom driver to implement the fingerprint logic based on your database.
Currently fingerprint is generated for the full input dataset based on file size and creation time, so if there are any change in the dataset, TFX will treat it as a new dataset.
If data is incremental change, I wonder if span concept can be used, thus you don't need to reprocess the processed spans, here is an e2e example .
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
Unless I am missing something, couldn't this be solved by looking at the newest artifact based on the timestamp? We currently set up external retry with def pod_retry_on_error_failure():
def _set_retry(container_op):
# Argo retry: https://argoproj.github.io/argo-workflows/examples/#retrying-failed-or-errored-steps
# default policy: OnFailure
# May not work with tfx where `tfx.utils.get_only_uri_in_dir` function is used
container_op.set_retry(
num_retries=2,
)
return _set_retry and append this function to the |
You can refer Reading data from BigQuery with TFX and Vertex Pipelines for QueryBasedExampleGen example. Thank you! |
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you. |
This issue was closed due to lack of activity after being marked stale for past 7 days. |
Over the past year and a half I've been fighting with the various forms of caching provided by Airflow/Kubeflow and TFX.
As it stands, caching is useful only in a very particular scenario, and I think it' needs a big rethink. At present it doesn't make much sense to have caching enabled at all, and useful caching does not exist.
Issue 1: Retrying doesn't work
Airflow and Kubeflow both provide a facility for retrying and caching. Caching in Kubeflow seems to conflict with TFX caching (#3049) and seems to be better left disabled when using TFX.
As far as I can see "retrying" with the orchestrator is pretty much impossible, even with both orchestrator and TFX cache disabled, as the step after the step being retried will fail due to seeing 2 output artifacts instead of 1 (#2805)
The only way to make use of caching, and the only use case as far as I can see, is when you create a new pipeline run with the same components and arguments. This is a basically a un-intuitive "retry" and relies on the fact that your ExampleGen split arguments remain the same. This leads me to my second problem with caching.
Issue 2: ExampleGen are cached ineffectively
I use predominantly Query based ExampleGen, with queries as simple as
[ SELECT * from my_data ]
. If caching is enabled, no pipeline run ever pulls data, making runs do nothing to produce new models with new data.This leaves you with two options if you want to use TFX to run you pipelines on a schedule:
WHERE $date > 1970-01-01
and leave cache enabled. This basically lets you "retry" pipelines in case you want to rerun later steps (either due to failure or changes) by making a new pipeline and relying on TFX cache, but is annoying to implement as if a pipeline runs for longer than one day, then the cache is invalidated. In general this may help but isn't very useful either.Issue 3: ExampleGen don't cache with new data, and do not lend themselves to fine tuning
If you want new data, there is basically no way to use cache as far as I can see. If you train with 100s of gigabytes, yuo'r:
How caching should work:
This would make reruns actually quick and only incur the transform/training time. I'm not sure to what extent transform can be re-cached, but that should also be possible in cases where a full-pass is not required.
TLDR:
The text was updated successfully, but these errors were encountered: