tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

UsherWang · 2021-12-14T00:35:39Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): custom code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows 10 20H2, OS build 19042.1348
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): anaconda pip install
TensorFlow version (use command below): v2.7.0-rc1-69-gc256c071bb2 2.7.0
Python version: Python 3.7.4
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: cuda10.0/cudnn11.5
GPU model and memory: 2080TI & 1080TI

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
In the case included here, the manually defined loss function includes 3rd order derivative. Attached please find the zip pack that includes the python scripts: NS_tf2.py.
In NS_tf2.py this term is in the form of
$\frac{\partial^{3} (pred)}{\partial (var1)\partial (var2)\partial (var2)}$
where pred represents the output of sequential network and var1 and var2 are tensors.

At the beginning of the python script
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
is used to choose between cpu and gpus.
The above code work well under cpu mode:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
but get stalled when setting hardware to either 1080ti or 2080ti.

Describe the expected behavior
Code works under gpu mode

Contributing

Do you want to contribute a PR? (yes/no):
Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Please find the code in attached pack. Please first cd to scripts' directory to run the scripts.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

Please find the code in attached pack. Please first cd to scripts' directory to run the scripts.

tf_bug.zip

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2021-12-16T07:14:11Z

@sanatmpa1 Was able to reproduce the issue on colab using TF v2.7.0 and tf-nightly(2.8.0.dev20211215),please find the attached gists for reference.Thanks!

JW1992 · 2022-01-06T19:36:40Z

@UsherWang can you please further clarify what "stalled" means? I opened the script for v2.7.0 by @sushreebarsa and saw that it ran till epoch 840, then got KeyboardInterrupt.

UsherWang · 2022-01-06T20:04:58Z

Hi @JW1992 ,

Thank you for your reply. As mentioned in description, under cpu mode it works well but it gets stalled under gpu mode with my env setup. In the test done by @sushreebarsa It's not clear about what platform (C/Gpu) he/she used. Here stall means TF gets stuck at specific iteration step, keeps running and doesn't produce results for an abnormally long time.

JW1992 · 2022-01-06T20:15:37Z

Thanks Usher, do you mean that in GPU the code gets blocked on a specific epoch? Is this repeatable (stalled on the same epoch)?

It would be really helpful if you can find a simpler reproduction and we can find the root cause much easier.

UsherWang · 2022-01-07T08:00:59Z

Thank you @JW1992 for your reply. I remember the behavior is not deterministic on my system. The epoch number which it stops kinda varies under different attempts. Btw I just tested the code on my linux system, it seems the problem doesn't happen on my linux but appear on windows.

UsherWang added the type:bug Bug label Dec 14, 2021

google-ml-butler bot assigned sushreebarsa Dec 14, 2021

sushreebarsa added TF 2.7 Issues related to TF 2.7.0 comp:gpu GPU related issues type:support Support issues and removed type:bug Bug labels Dec 14, 2021

sushreebarsa assigned sanatmpa1 and unassigned sushreebarsa Dec 16, 2021

sanatmpa1 assigned sachinprasadhs and unassigned sanatmpa1 Dec 24, 2021

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

UsherWang commented Dec 14, 2021

sushreebarsa commented Dec 16, 2021

JW1992 commented Jan 6, 2022

UsherWang commented Jan 6, 2022

JW1992 commented Jan 6, 2022

UsherWang commented Jan 7, 2022

tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

Comments

UsherWang commented Dec 14, 2021

sushreebarsa commented Dec 16, 2021

JW1992 commented Jan 6, 2022

UsherWang commented Jan 6, 2022

JW1992 commented Jan 6, 2022

UsherWang commented Jan 7, 2022