Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified mechanism for setting process-level settings #8136

Closed
yaroslavvb opened this issue Mar 6, 2017 · 8 comments
Closed

Unified mechanism for setting process-level settings #8136

yaroslavvb opened this issue Mar 6, 2017 · 8 comments
Assignees
Labels
type:feature Feature requests

Comments

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Mar 6, 2017

Some settings in TensorFlow apply to all sessions in the process. Examples: size of Eigen thread-pool, allocator growth strategy, logging verbosity

There are currently two places where such process properties are set:

  1. Environment variables
  2. tf.ConfigProto passed to the first tf.Session() or tf.Server() call

the 1. lacks discoverability. For instance required SM count to make GPU visible to TensorFlow is set through TF_MIN_GPU_MULTIPROCESSOR_COUNT which is not documented outside of gpu_device.cc. Additionally, it has unclear semantics. When does changing TF_CPP_MIN_VLOG_LEVEL environment variable have an effect on logging? Empirically, changing it after import tf has an effect, changing it after first tf.Session call has no effect.

the 2. leads to confusion when you specify conflicting settings. For instance, in #4455 the user was confused that config=tf.ConfigProto(intra_op_parallelism_threads=1 had no effect. The reason is that intra_op_parallelism_threads specifies the size of process global ThreadPool, and this setting was already fixed when user called tf.Server earlier. (we also ran into this issue on our deployment)

cc @mrry
assigning to @tatatodd for triage since he asked me to file this issue

@tatatodd
Copy link
Contributor

tatatodd commented Mar 9, 2017

Thanks for filing this feature-request issue @yaroslavvb, and the great analysis!

Assigning to @mrry since he knows about Session subtleties much better than I do.

@mrry mrry assigned poxvoculi and zheng-xq and unassigned mrry Mar 14, 2017
@mrry
Copy link
Contributor

mrry commented Mar 14, 2017

It turns out there are some even subtler subtleties related to the ownership of GPU devices, allocators, and streams, which ought to be solved before we change anything else about configuration.

Assigning this to @zheng-xq and @poxvoculi, who're looking into the GPU issues.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@zheng-xq
Copy link
Contributor

I agree with Derek that these are indeed subtle issues. We are systematically changing devices, allocators and streams to global resources. But the API still at the session level for backward compatibility.

Closing this one for now. Feel free to reopen if someone wants to contribute a new design.

@ppwwyyxx
Copy link
Contributor

Another kind of failure:

# running on a machine with >1 GPUs
print(tf.test.is_gpu_available())
# or call list_devices()

cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg)   # fail

This is quite annoying: some line of code executed earlier leads to an error later with strange error message.

@anpark
Copy link

anpark commented Feb 3, 2018

same question for me, thanks

@venuswu
Copy link

venuswu commented Jun 4, 2018

It's so tricky in the tensorflow session.

@austinlostinboston
Copy link

Another kind of failure:

# running on a machine with >1 GPUs
print(tf.test.is_gpu_available())
# or call list_devices()

cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg)   # fail

This is quite annoying: some line of code executed earlier leads to an error later with strange error message.

WRT to this problem, moving the tf.test.is_gpu_available() to after sess = tf.Session(config=cfg) fixes the memory/cpu error that you probably are seeing. What's happening is tf.test.is_gpu_available() has to start a session to check the resources. This session is started with the default config (which has a setting that allows it to consume all of the cpu/ram resources). Then, when you try to initialize your own session, it throws an error because there are no gpu resources left for it.

IMO, one of two corrections should be used.

  1. If tf.test.is_gpu_available() has the authority to start it's own session, it should kill that session after the resources have been checked.
  2. This really needs to be in the docs, since it isn't obvious to new users, or even ones that have been using the codebase for awhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

10 participants