Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize only CPU devices in optimize_dataset_op #24461

Merged

Conversation

@samikama
Copy link
Contributor

commented Dec 20, 2018

optimize_dataset_op is making a rouge device initialization for all devices in the system through grappler_item_builder even though only CPU devices are used. This is causing problems mentioned in #24303 and #23458. This PR initializes only CPU devices to workaround these issues. Adding @asimshankar @jlebar and @azaks2 review. Please forward to correct responsible if needed.

@Harshini-Gadige Harshini-Gadige self-assigned this Dec 20, 2018

@jlebar jlebar requested review from ezhulenev and removed request for asimshankar, jlebar and azaks2 Dec 20, 2018

@jlebar

This comment has been minimized.

Copy link
Member

commented Dec 20, 2018

Thank you for the change!

Adding @asimshankar @jlebar and @azaks2 review.

You'll want someone who works on Grappler to review this.

Looking over this commit, I'll mention again what I said in a previous PR that you may want to get someone who sits near you to read over the comments.

@samikama

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2018

@tfboyd This PR is needed with #24303 to run TF+XLA with HOROVOD or any other multi-process environment.

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

@hgadig anything we can do to help this PR to be merged ? Thanks a lot for your time

@Harshini-Gadige

This comment has been minimized.

Copy link

commented Jan 7, 2019

@DEKHTIARJonathan I'm taking care of this one and an internal CL has been submitted. I'll keep you posted.

feihugis pushed a commit to feihugis/tensorflow that referenced this pull request Jan 7, 2019

@tensorflow-copybara tensorflow-copybara merged commit 85ef6de into tensorflow:master Jan 7, 2019

14 of 18 checks passed

MacOS Contrib Internal CI build failed
Details
Ubuntu Python2 Internal CI build failed
Details
Windows Bazel Internal CI build failed
Details
Windows Bazel GPU Internal CI build failed
Details
Android Demo App Internal CI build successful
Details
Experimental clang-format Check Internal CI build successful
Details
GPU CC Internal CI build successful
Details
GPU Python3 Internal CI build successful
Details
MacOS Python2 and CC Internal CI build successful
Details
Ubuntu CC Internal CI build successful
Details
Ubuntu Makefile Internal CI build successful
Details
Ubuntu Python3 Internal CI build successful
Details
Ubuntu Python3 PIP Internal CI build successful
Details
Ubuntu Sanity Internal CI build successful
Details
Ubuntu contrib Internal CI build successful
Details
XLA Internal CI build successful
Details
cla/google All necessary CLAs are signed
import/copybara Change imported to the internal review system
Details
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

@feihugis @hgadig thanks a lot for the merge in master. What are the odds that the commits
b2fa0df and a427c13 will be cherry picked for TF 1.13 release ?

Thanks a lot

@jlebar

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

@asimshankar

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

@DEKHTIARJonathan @jlebar : @aselle is looking into cherrypicking the two for the 1.13 release. Will keep you posted.

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2019

Hi @jlebar and @aselle any news regarding the cherrypicking for TF 1.13 ?
I tested the current release (rc0) and the latest version of the branch r1.13. It looks like the issue is not fixed and thus the commits: b2fa0df and a427c13 are likely not merged/cherrypicked.

As TF 1.13 will probably be the last release for TF 1.x, it would be very damageable if these bug fixes could not be merged in TF 1.13.

Thanks a lot for your help.

@alsrgv

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2019

Any chance a427c13 can be cherry-picked before TF 1.13.0 is out? Since XLA is compiled-in by default, this causes many multi-process use cases, including Horovod, to fail with TF 1.13.0rc2:

Mon Feb 18 23:23:54 2019[1,1]<stderr>:2019-02-18 23:23:54.160127: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Mon Feb 18 23:23:54 2019[1,1]<stderr>:2019-02-18 23:23:54.163429: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Mon Feb 18 23:23:54 2019[1,1]<stderr>:I0218 23:23:54.176867 139666301507328 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

Excerpt from nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     80320      C   python                                     10445MiB |
|    0     80321      C   python                                       147MiB |
|    0     80322      C   python                                       147MiB |
|    0     80323      C   python                                       147MiB |
|    1     80320      C   python                                       147MiB |
|    1     80321      C   python                                     10445MiB |
|    1     80322      C   python                                       147MiB |
|    1     80323      C   python                                       147MiB |
|    2     80320      C   python                                       147MiB |
|    2     80321      C   python                                       147MiB |
|    2     80322      C   python                                     10445MiB |
|    2     80323      C   python                                       147MiB |
|    3     80320      C   python                                       147MiB |
|    3     80321      C   python                                       147MiB |
|    3     80322      C   python                                       147MiB |
|    3     80323      C   python                                     10445MiB |
+-----------------------------------------------------------------------------+

cc @martinwicke

@martinwicke

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

@gunan assuming @aselle is gone already. Not sure what the status is.

@tfboyd

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

Leaving note to say I totally missed this in my mail box. Including Justin's direct forward. It seems it will not make the release which has gone on for 45+ days.

@tfboyd

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

Falling on sword aside. We have a weekly meeting with NVIDIA now where we are tracking PRs/Tasks more closely which should reduce these items being missed and speedup resolution.

@samikama

This comment has been minimized.

Copy link
Contributor Author

commented Feb 19, 2019

Hi,
According to #24799 it has already been merged on Jan9. Did someone revert it accidentally?

@tfboyd

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

@samikama it was not in the 1.13 branch that was cut before Jan 8th. That is the discussion.

@gunan

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

@gunan

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

As @samikama pointed out, @aselle already cherrypickd this.

@alsrgv

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

@gunan, @tfboyd, @samikama, I think the piece that has been lost along the lines is commit a427c13, which is also necessary. Without that commit, XLA uses a bit of memory of adjacent GPUs which is often enough to make cuDNN go mad. Unfortunately, XLA and --allow_growth do not seem to live together well: #23888.

Any chance a427c13 could be cherrypicked before the release?

gnperdue added a commit to gnperdue/BashDotFiles that referenced this pull request Feb 23, 2019

roll back to TF 1.10 due to this bug
tensorflow/tensorflow#24461

which seems to mess up training (certainly produces lots of warnings and
errors)
@alsrgv

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

@gunan, congrats on 1.13.1 release!

What is a good path forward for this issue - will there be another 1.13.x release with a427c13 included, is 1.14.0rc0 around the corner, or should we recommend folks to stick with 1.12.1 or tf-nightly-gpu if they have issues with XLA grabbing adjacent GPUs?

@tfboyd

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

I dislike answers that are not good but they are better than silence.

I think the answer is to stick with 1.12 or roll the dice with a tf-nightly-gpu. 1.13 other than cherry picks was cut back in December. I do not have a crystal ball on 1.14 and a wild guess is well wild and likely not useful.

I am sorry I did not see this problem back in December and get this picked in. That is on me, I also thought we might finish 1.13 mid-Jan so many things did not go well for any of us for various reasons. :-)

@alsrgv

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

@tfboyd, thanks for the quick response! I hope 1.14 is not too far ahead, moving to CUDA 10 would be great :-)

On a systematic side, I think adding an integration test that runs on multi-GPU VM and tests that visible_devices fencing works properly (using nvidia-smi) in various use cases would be great.

@martinwicke

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

I think such a test would be fantastic. We have multi-GPU machines for CI, so we could definitely do that. The tests would have to be whitelisted (since we run tests in parallel), but I think that's possible. @annarev do you know about the setup for running tests on more than one GPU on Kokoro?

@annarev

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

It does sound possible. Right now, we we just run 1 test per GPU because we set CUDA_VISIBLE_DEVICES to a single gpu index. So, we could just set it to multiple indexes.
I suppose we can either whitelist or create a separate build for multi-gpu tests.

@martinwicke

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

@tfboyd

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

As a small fork, we should have horovod added (it may get added due to some other testing) to our "sandbox" tensorflow/models/offiical/resnet (this is becoming a tf_cnn_benchmarks but focused on using only High Level APIs with some FLAGed exceptions. It would be great to have a single machine horovod test running with my other nightly tests. This is separate from the unit test you are talking about above and get it working with TF 2.0.

@alsrgv

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

@annarev, @martinwicke,

I was looking into adding the test. From what I reverse engineered, you're using https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh to run tests on available GPUs, and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh is one of the test suites.

In order to support multi-GPU tests, I imagine we'd have to enhance parallel_gpu_execute.sh to support reserving multiple GPUs and make another test suite for these multi-GPU tests.

One simple way to verify memory usage would be to run nvidia-smi --query-gpu=index,memory.used --format=csv in a subprocess after running various steps, such as importing TensorFlow or creating a session, and analyze its output.

Does all sound reasonable?

@tfboyd,

This ResNet model could be a great test case for the upcoming HorovodDistributionStrategy, since it relies on Estimators. I'll check up on that thread with @guptapriya and @yuefengz.

@annarev

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

Great meeting you at the devsummit! As discussed, I think it is better to setup a separate build where we don't use parallel_gpu_execute.sh and run multi-gpu tests sequentially. I will work on setting one up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.