Transform module get a GPU resource exhausted #3343

axelning · 2021-03-08T10:01:08Z

If the bug is related to a specific library below, please raise an issue in the
respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

Have I specified the code to reproduce the issue
(Yes/No): yes
Environment in which the code is executed (e.g., Local
(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): - TensorFlow
version (you are using): 2.3.2- TFX Version: 0.26.1- Python version:3.6.7

Describe the current behavior
In tfx transform module it calls tensorflow_transform> beam >impl.py:1058

schema = schema_inference.infer_feature_schema_v2(
      structured_outputs,
      metadata_fn.get_concrete_function(),
      evaluate_schema_overrides=False)

this will call infer_feature_schma_v2 in schema_inference.py :163

in this function, tf2_utils.supply_missing_inputs(structured_inputs, batch_size=1) in line 195 will tries to convert inputs to tensor and will not release the gpu memory when finished. By default this operation takes 7715 MB on my singlee Tesla p40

When I run into OOM because the following training starts to apply for the GPU, and after I stop the whole process and continue, cause the transform has been saved and the trainning goes successful, which means this part does not need to keep in the GPU from it starts.

So would you please offer a way to release this part of memory? Cause i tried using a multiprocessing.Process(), it just stucked.

my temporary way is to set the gpu memory growth in the main tfx pipeline, but this will lead to another problem: trainer won't yield the resource after the training is done.

And by the way when trainer complete the training it should release the GPU as well, still now I cant get this done because if i starts multiprocessing, the traning process cant find cuda device, or else, the memory just got drained by the transform

Describe the expected behavior
please, take the gpu memory as a exotic resource when supply your product to us, maybe we only got single GPU in our environment

Standalone code to reproduce the issue Providing a bare minimum test case or
step(s) to reproduce the problem will greatly help us to debug the issue. If
possible, please share a link to Colab/Jupyter/any notebook.

Name of your Organization (Optional)

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
""

The text was updated successfully, but these errors were encountered:

1025KB · 2021-03-09T19:00:16Z

can you try replacing --direct_num_workers=0 in the beam arguments with --direct_num_workers=1?

it's a known issue, a workaround would be try "TF_FORCE_GPU_ALLOW_GROWTH=true" env var, if it still OOM, you can set direct_num_workers=n to use certain number of processes

axelning · 2021-03-12T11:41:26Z

can you try replacing --direct_num_workers=0 in the beam arguments with --direct_num_workers=1?

it's a known issue, a workaround would be try "TF_FORCE_GPU_ALLOW_GROWTH=true" env var, if it still OOM, you can set direct_num_workers=n to use certain number of processes

Actually I have set the worker number to 1

The crux is.... as I mentioned before, by set allow_growth, the transform module can take very little portion of memory, but the trainer will hold the memory till the whole tfx process is done. This will lead the evaluator into OOM when the training is finished when the trained model takes almost whole space (like bert multilinugal)

The trainer does not need any memory when it finished export model, so why we keep it in the memory?

krislc · 2021-05-12T11:11:20Z

any update ？

singhniraj08 · 2023-04-19T08:13:32Z

@axelning,

Following the Preprocessing options summary, for Full-pass during training and instance-level during serving, it is recommended to use Dataflow (Apache Beam + TFT). Please try using Dataflow, as transformation logic and computed statistics during training are stored as a TensorFlow graph that's attached to the exported model for serving.

Thank you!

github-actions · 2023-04-27T01:53:38Z

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-05-05T01:49:54Z

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ml-butler · 2023-05-05T01:49:57Z

Are you satisfied with the resolution of your issue?
Yes
No

axelning added the type:bug label Mar 8, 2021

arghyaganguly assigned arghyaganguly and 1025KB and unassigned arghyaganguly Mar 8, 2021

arghyaganguly added the stat:awaiting tensorflower label Mar 8, 2021

arghyaganguly mentioned this issue Mar 9, 2021

Transform won't yield memory in tfx after transform and it takes up total memory tensorflow/transform#227

Open

arghyaganguly added stat:awaiting response and removed stat:awaiting tensorflower labels Mar 10, 2021

arghyaganguly added stat:awaiting tensorflower and removed stat:awaiting response labels Mar 12, 2021

krislc mentioned this issue May 13, 2021

got killed when evaluating model #3727

Closed

singhniraj08 self-assigned this Apr 19, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting tensorflower labels Apr 19, 2023

github-actions bot added the stale label Apr 27, 2023

github-actions bot closed this as completed May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform module get a GPU resource exhausted #3343

Transform module get a GPU resource exhausted #3343

axelning commented Mar 8, 2021 •

edited

Loading

1025KB commented Mar 9, 2021 •

edited

Loading

axelning commented Mar 12, 2021

krislc commented May 12, 2021

singhniraj08 commented Apr 19, 2023

github-actions bot commented Apr 27, 2023

github-actions bot commented May 5, 2023

google-ml-butler bot commented May 5, 2023

Transform module get a GPU resource exhausted #3343

Transform module get a GPU resource exhausted #3343

Comments

axelning commented Mar 8, 2021 • edited Loading

1025KB commented Mar 9, 2021 • edited Loading

axelning commented Mar 12, 2021

krislc commented May 12, 2021

singhniraj08 commented Apr 19, 2023

github-actions bot commented Apr 27, 2023

github-actions bot commented May 5, 2023

google-ml-butler bot commented May 5, 2023

axelning commented Mar 8, 2021 •

edited

Loading

1025KB commented Mar 9, 2021 •

edited

Loading