Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf2 stops working under gpu mode with 3rd order derivative included in loss function #53410

Open
UsherWang opened this issue Dec 14, 2021 · 5 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.7 Issues related to TF 2.7.0 type:support Support issues

Comments

@UsherWang
Copy link

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): custom code
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows 10 20H2, OS build 19042.1348
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): anaconda pip install
  • TensorFlow version (use command below): v2.7.0-rc1-69-gc256c071bb2 2.7.0
  • Python version: Python 3.7.4
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: cuda10.0/cudnn11.5
  • GPU model and memory: 2080TI & 1080TI

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
In the case included here, the manually defined loss function includes 3rd order derivative. Attached please find the zip pack that includes the python scripts: NS_tf2.py.
In NS_tf2.py this term is in the form of

where pred represents the output of sequential network and var1 and var2 are tensors.

At the beginning of the python script
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
is used to choose between cpu and gpus.
The above code work well under cpu mode:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
but get stalled when setting hardware to either 1080ti or 2080ti.

Describe the expected behavior
Code works under gpu mode

Contributing

  • Do you want to contribute a PR? (yes/no):
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Please find the code in attached pack. Please first cd to scripts' directory to run the scripts.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

Please find the code in attached pack. Please first cd to scripts' directory to run the scripts.

tf_bug.zip

@UsherWang UsherWang added the type:bug Bug label Dec 14, 2021
@sushreebarsa sushreebarsa added TF 2.7 Issues related to TF 2.7.0 comp:gpu GPU related issues type:support Support issues and removed type:bug Bug labels Dec 14, 2021
@sushreebarsa
Copy link
Contributor

@sanatmpa1 Was able to reproduce the issue on colab using TF v2.7.0 and tf-nightly(2.8.0.dev20211215),please find the attached gists for reference.Thanks!

@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 27, 2021
@JW1992
Copy link
Contributor

JW1992 commented Jan 6, 2022

@UsherWang can you please further clarify what "stalled" means? I opened the script for v2.7.0 by @sushreebarsa and saw that it ran till epoch 840, then got KeyboardInterrupt.

@UsherWang
Copy link
Author

Hi @JW1992 ,

Thank you for your reply. As mentioned in description, under cpu mode it works well but it gets stalled under gpu mode with my env setup. In the test done by @sushreebarsa It's not clear about what platform (C/Gpu) he/she used. Here stall means TF gets stuck at specific iteration step, keeps running and doesn't produce results for an abnormally long time.

@JW1992
Copy link
Contributor

JW1992 commented Jan 6, 2022

Thanks Usher, do you mean that in GPU the code gets blocked on a specific epoch? Is this repeatable (stalled on the same epoch)?

It would be really helpful if you can find a simpler reproduction and we can find the root cause much easier.

@UsherWang
Copy link
Author

Thank you @JW1992 for your reply. I remember the behavior is not deterministic on my system. The epoch number which it stops kinda varies under different attempts. Btw I just tested the code on my linux system, it seems the problem doesn't happen on my linux but appear on windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.7 Issues related to TF 2.7.0 type:support Support issues
Projects
None yet
Development

No branches or pull requests

5 participants