Unified mechanism for setting process-level settings #8136

yaroslavvb · 2017-03-06T18:36:21Z

Some settings in TensorFlow apply to all sessions in the process. Examples: size of Eigen thread-pool, allocator growth strategy, logging verbosity

There are currently two places where such process properties are set:

Environment variables
tf.ConfigProto passed to the first tf.Session() or tf.Server() call

the 1. lacks discoverability. For instance required SM count to make GPU visible to TensorFlow is set through TF_MIN_GPU_MULTIPROCESSOR_COUNT which is not documented outside of gpu_device.cc. Additionally, it has unclear semantics. When does changing TF_CPP_MIN_VLOG_LEVEL environment variable have an effect on logging? Empirically, changing it after import tf has an effect, changing it after first tf.Session call has no effect.

the 2. leads to confusion when you specify conflicting settings. For instance, in #4455 the user was confused that config=tf.ConfigProto(intra_op_parallelism_threads=1 had no effect. The reason is that intra_op_parallelism_threads specifies the size of process global ThreadPool, and this setting was already fixed when user called tf.Server earlier. (we also ran into this issue on our deployment)

cc @mrry
assigning to @tatatodd for triage since he asked me to file this issue

The text was updated successfully, but these errors were encountered:

tatatodd · 2017-03-09T18:32:20Z

Thanks for filing this feature-request issue @yaroslavvb, and the great analysis!

Assigning to @mrry since he knows about Session subtleties much better than I do.

mrry · 2017-03-14T19:43:34Z

It turns out there are some even subtler subtleties related to the ownership of GPU devices, allocators, and streams, which ought to be solved before we change anything else about configuration.

Assigning this to @zheng-xq and @poxvoculi, who're looking into the GPU issues.

tensorflowbutler · 2017-12-22T07:48:37Z

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

zheng-xq · 2017-12-22T17:51:51Z

I agree with Derek that these are indeed subtle issues. We are systematically changing devices, allocators and streams to global resources. But the API still at the session level for backward compatibility.

Closing this one for now. Feel free to reopen if someone wants to contribute a new design.

ppwwyyxx · 2018-01-30T20:42:52Z

Another kind of failure:

# running on a machine with >1 GPUs
print(tf.test.is_gpu_available())
# or call list_devices()

cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg)   # fail

This is quite annoying: some line of code executed earlier leads to an error later with strange error message.

anpark · 2018-02-03T02:29:40Z

same question for me, thanks

venuswu · 2018-06-04T07:14:05Z

It's so tricky in the tensorflow session.

austinlostinboston · 2019-10-25T14:34:51Z

Another kind of failure:
# running on a machine with >1 GPUs
print(tf.test.is_gpu_available())
# or call list_devices()

cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg)   # fail
This is quite annoying: some line of code executed earlier leads to an error later with strange error message.

WRT to this problem, moving the tf.test.is_gpu_available() to after sess = tf.Session(config=cfg) fixes the memory/cpu error that you probably are seeing. What's happening is tf.test.is_gpu_available() has to start a session to check the resources. This session is started with the default config (which has a setting that allows it to consume all of the cpu/ram resources). Then, when you try to initialize your own session, it throws an error because there are no gpu resources left for it.

IMO, one of two corrections should be used.

If tf.test.is_gpu_available() has the authority to start it's own session, it should kill that session after the resources have been checked.
This really needs to be in the docs, since it isn't obvious to new users, or even ones that have been using the codebase for awhile.

rwth-i6/returnn#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

rwth-i6#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

yaroslavvb assigned tatatodd Mar 6, 2017

yaroslavvb mentioned this issue Mar 6, 2017

pywrap_tensorflow.list_devices() allocates all available memory (on all GPU devices) #8021

Closed

tatatodd added the type:feature Feature requests label Mar 9, 2017

tatatodd assigned mrry and unassigned tatatodd Mar 9, 2017

mrry assigned poxvoculi and zheng-xq and unassigned mrry Mar 14, 2017

rhofour mentioned this issue Apr 8, 2017

Setting session from TrainConfig doesn't seem to work tensorpack/tensorpack#220

Closed

mrry mentioned this issue Apr 22, 2017

tensorflow.python.client.device_lib.list_local_devices() Bug #9374

Closed

yaroslavvb mentioned this issue Nov 15, 2017

how to change the gpu_fraction_per_process from default 1 to 0.5 or 0.7? #14585

Closed

zheng-xq closed this as completed Dec 22, 2017

aaroey mentioned this issue Feb 5, 2018

Virtual GPU config crashes TensorFlow after physical GPU loaded #16753

Closed

ET-Chan mentioned this issue Mar 9, 2018

gpu_allow_growth is not working OpenNMT/OpenNMT-tf#80

Closed

deadeyegoodwin mentioned this issue Apr 13, 2018

Tensorflow (TF) Serving on Multi-GPU box tensorflow/serving#311

Closed

ppwwyyxx mentioned this issue Aug 24, 2018

tf.test.is_gpu_available(True) allocates all GPU(s) VRAM #21836

Closed

byronyi mentioned this issue Oct 10, 2018

Dummy session trick incompatible with the new suballocator, causes gdr/verbs to fail tensorflow/benchmarks#257

Closed

ppwwyyxx mentioned this issue Feb 20, 2019

BUG: symbolic layer triggers device creation #25946

Closed

ppwwyyxx mentioned this issue Apr 26, 2019

GPU Device Selector in TensorFlow 2.0 #26460

Closed

ppwwyyxx mentioned this issue Oct 4, 2019

Multiple sessions with per_process_gpu_memory_fraction #32919

Closed

albertz mentioned this issue Jun 12, 2020

Control GPU Memory allocation rwth-i6/returnn#300

Closed

albertz added a commit to albertz/playground that referenced this issue Jun 12, 2020

extend

f9ceb77

rwth-i6/returnn#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

albertz added a commit to rwth-i6/returnn that referenced this issue Jun 12, 2020

get_tf_list_local_devices, fix for custom gpu_options

a54a348

#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

Spotlight0xff pushed a commit to Spotlight0xff/returnn that referenced this issue Sep 5, 2020

get_tf_list_local_devices, fix for custom gpu_options

27e2022

rwth-i6#300 tensorflow/tensorflow#9374 tensorflow/tensorflow#8021 tensorflow/tensorflow#8136

deadeyegoodwin mentioned this issue Jan 28, 2021

Add placement API for C interface - control subset of devices to make visible for model #46764

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified mechanism for setting process-level settings #8136

Unified mechanism for setting process-level settings #8136

yaroslavvb commented Mar 6, 2017 •

edited

tatatodd commented Mar 9, 2017

mrry commented Mar 14, 2017

tensorflowbutler commented Dec 22, 2017

zheng-xq commented Dec 22, 2017

ppwwyyxx commented Jan 30, 2018

anpark commented Feb 3, 2018

venuswu commented Jun 4, 2018

austinlostinboston commented Oct 25, 2019

Unified mechanism for setting process-level settings #8136

Unified mechanism for setting process-level settings #8136

Comments

yaroslavvb commented Mar 6, 2017 • edited

tatatodd commented Mar 9, 2017

mrry commented Mar 14, 2017

tensorflowbutler commented Dec 22, 2017

zheng-xq commented Dec 22, 2017

ppwwyyxx commented Jan 30, 2018

anpark commented Feb 3, 2018

venuswu commented Jun 4, 2018

austinlostinboston commented Oct 25, 2019

yaroslavvb commented Mar 6, 2017 •

edited