Components not cached in Airflow? #43

loiccordone · 2019-04-17T09:32:30Z

Hello,

I'm running a TFX pipeline in Airflow, with a Transform component. Each time I trigger my DAG, the CsvExampleGen, StatisticsGen, SchemaGen and ExampleValidator all use their cached outputs since no changes occured, each one takes ~20s on my machine. On the other hand, the Transform component re-compute its outputs at each time (taking ~10min) even though nothing changed.
My preprocessing_fn only calculates z_score on dense float features (+ fill-in missing values).

I also have problems with the Evaluator component, it uses its cached results even when the Trainer outputs are different. I am forced to manually delete the Evaluator output folder before running my DAG.

Is this normal behavior?
Thanks

krazyhaas · 2019-04-17T16:26:59Z

That is definitely not the behavior expected. Transform should also caching the result if the inputs haven't changed, and evaluator should not be using the cached results if the trainer outputs have changed. I'll try to recreate this here in the lab.

krazyhaas · 2019-04-17T21:50:05Z

I wasn't able to recreate this. I used the simple pipeline and ran it a few times. Except for the first run or whenever I deleted the metadata db, it would always use the cached artifacts. I also didn't experience difference between caching with transform and evaluator. So, a few followup questions:

are you working with TFX from 0.12.0 or cloned from github?
what is the behavior when you run with the example?
can you share your pipeline config?

1025KB · 2019-04-17T22:01:46Z

One thing to mention, if taxi_utils.py is changed, cache won't hit for transform and trainer

For evaluator, it takes example gen output and trainer output, if those are changed, it shouldn't hit cache, could you check the pipeline output folder to see if there are new output (new # subfolder under example_gen and trainer) generated from examplegen or trainer?

loiccordone · 2019-04-18T10:32:09Z

are you working with TFX from 0.12.0 or cloned from github?
what is the behavior when you run with the example?

I'm working with TFX 0.12.0. No problem when I run the example. I'll try to modify the Trainer in the example to see if the Evaluator uses its cached results or no.

One thing to mention, if taxi_utils.py is changed, cache won't hit for transform and trainer

Thanks, that's exactly what's happening, I've been modifying my model in the utils.py file. I separated the transformation and the model in 2 files + 1 utils file and now it works. Even though I had to add "sys.path.append(path_to_utils_folder)" in my pipeline definition to avoid "no module named xxx". Is it why you made an unique taxi_utils file in the example?

For evaluator, it takes example gen output and trainer output, if those are changed, it shouldn't hit cache, could you check the pipeline output folder to see if there are new output (new # subfolder under example_gen and trainer) generated from examplegen or trainer?

My ExampleGen doesn't change (I don't generate new examples), but my Trainer output changes at each modification (IDs 44-47-49-51-53-59) while the Evaluator doesn't produce new output (ID 45).

I'll try with the Taxi example to see if it comes from my pipeline.
Thanks!

krazyhaas · 2019-04-18T18:12:33Z

One thing to mention, if taxi_utils.py is changed, cache won't hit for transform and trainer

Thanks, that's exactly what's happening, I've been modifying my model in the utils.py file. I separated the transformation and the model in 2 files + 1 utils file and now it works. Even though I had to add "sys.path.append(path_to_utils_folder)" in my pipeline definition to avoid "no module named xxx". Is it why you made an unique taxi_utils file in the example?

Ok, that sounds like the issue then. If anything about the component's input changes -- including the injected file's checksum -- then the component is rerun. If you're modifying the utils.py file and then running the pipeline, it will trigger a new execution of both transform and trainer.

My ExampleGen doesn't change (I don't generate new examples), but my Trainer output changes at each modification (IDs 44-47-49-51-53-59) while the Evaluator doesn't produce new output (ID 45).

Evaluator should trigger a new execution given the Trainer is probably producing a new model on each iteration. Can you attach the evaluator logs for the more recent runs?

loiccordone · 2019-04-19T07:50:00Z

Ok, that sounds like the issue then. If anything about the component's input changes -- including the injected file's checksum -- then the component is rerun. If you're modifying the utils.py file and then running the pipeline, it will trigger a new execution of both transform and trainer.

Yes it's clearer now, thanks for this clarification.

Can you attach the evaluator logs for the more recent runs?

The log is attached. There is indeed some weird lines in the log (lines 64-65), but I still can get the results in a Jupyter notebook and display them.
1.log

1025KB · 2019-04-22T22:07:17Z

Hi, is that possible to send us the entire pipeline log folder in a zip file? we need to check the caching logs and executor logs in each run

loiccordone · 2019-04-23T08:25:28Z

The logs are attached.
The Evaluator checkcache log detect a new Trainer output folder at each run but says the artifacts are the same.
Thanks for the help!
logs.zip

ruoyu90 · 2019-04-23T21:07:27Z

Thanks @loiccordone ! This helps to detect a bug in our codebase. I will have a PR soon to address this.

PiperOrigin-RevId: 244939000

PiperOrigin-RevId: 245101614

This PR adds Anthony to the Release team of sig-io with his PyPI.org. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

krazyhaas self-assigned this Apr 17, 2019

krazyhaas added stat:awaiting tensorflower type:bug labels Apr 17, 2019

krazyhaas added stat:awaiting response and removed stat:awaiting tensorflower labels Apr 17, 2019

tfx-copybara pushed a commit that referenced this issue Apr 24, 2019

Addressing issue #43: Fix a caching bug that prevents new run.

f433a60

PiperOrigin-RevId: 244939000

tfx-copybara mentioned this issue Apr 24, 2019

Addressing issue #43: Fix a caching bug that prevents new run. #53

Merged

tfx-copybara pushed a commit that referenced this issue Apr 24, 2019

Addressing issue #43: Fix a caching bug that prevents new run.

bf4298a

PiperOrigin-RevId: 244939000

tfx-copybara pushed a commit that referenced this issue Apr 24, 2019

Addressing issue #43: Fix a caching bug that prevents new run.

1880691

PiperOrigin-RevId: 245101614

ruoyu90 closed this as completed Apr 26, 2019

aaronlelevier pushed a commit to aaronlelevier/tfx that referenced this issue May 10, 2019

Addressing issue tensorflow#43: Fix a caching bug that prevents new run.

86d5968

PiperOrigin-RevId: 245101614

ruoyu90 pushed a commit to ruoyu90/tfx that referenced this issue Aug 28, 2019

Add Anthony to the Release team of sig-io (tensorflow#43)

b44f7b4

This PR adds Anthony to the Release team of sig-io with his PyPI.org. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Components not cached in Airflow? #43

Components not cached in Airflow? #43

loiccordone commented Apr 17, 2019

krazyhaas commented Apr 17, 2019

krazyhaas commented Apr 17, 2019

1025KB commented Apr 17, 2019

loiccordone commented Apr 18, 2019

krazyhaas commented Apr 18, 2019

loiccordone commented Apr 19, 2019

1025KB commented Apr 22, 2019

loiccordone commented Apr 23, 2019

ruoyu90 commented Apr 23, 2019

Components not cached in Airflow? #43

Components not cached in Airflow? #43

Comments

loiccordone commented Apr 17, 2019

krazyhaas commented Apr 17, 2019

krazyhaas commented Apr 17, 2019

1025KB commented Apr 17, 2019

loiccordone commented Apr 18, 2019

krazyhaas commented Apr 18, 2019

loiccordone commented Apr 19, 2019

1025KB commented Apr 22, 2019

loiccordone commented Apr 23, 2019

ruoyu90 commented Apr 23, 2019